The Foundation of High Availability
In distributed systems, failures are inevitable. Health checks are the primary mechanism used by load balancers and service meshes to detect failures and redirect traffic. Without them, your system suffers from the 'black hole' effect, where traffic is sent to dead or non-responsive instances.
1. Active vs. Passive Health Checks
- Active Health Checks: The load balancer proactively sends requests to backends at regular intervals. It determines health based on the response.
- Passive Health Checks (Outlier Detection): The load balancer monitors real-time traffic. If a backend starts failing a certain percentage of user requests (e.g., 5xx errors), it is temporarily ejected from the pool.
2. TCP vs. HTTP Health Checks
- TCP Checks: A simple 'three-way handshake' to see if the port is open. It's fast but can be misleading; a process might accept connections but be stuck in a deadlock internally.
- HTTP Checks: The gold standard for applications. The load balancer requests a specific path (e.g.,
/healthz). The application can perform internal dependency checks (DB connectivity, disk space) before returning200 OK.