Load Balancing and Reverse Proxies

Health Checks and Failover

0:00

0% complete

LearnStep 1 of 3

Mastering Resilience: Health Checks and Failover

Learning Objectives

•Distinguish between active and passive health checking mechanisms
•Design effective health check endpoints for microservices using HTTP and TCP
•Implement the circuit breaker pattern to prevent cascading failures
•Execute zero-downtime maintenance using connection draining techniques

Lesson Outline

LearnStep 1/3

Mastering Resilience: Health Checks and Failover

The Foundation of High Availability

In distributed systems, failures are inevitable. Health checks are the primary mechanism used by load balancers and service meshes to detect failures and redirect traffic. Without them, your system suffers from the 'black hole' effect, where traffic is sent to dead or non-responsive instances.

1. Active vs. Passive Health Checks

Active Health Checks: The load balancer proactively sends requests to backends at regular intervals. It determines health based on the response.
Passive Health Checks (Outlier Detection): The load balancer monitors real-time traffic. If a backend starts failing a certain percentage of user requests (e.g., 5xx errors), it is temporarily ejected from the pool.

2. TCP vs. HTTP Health Checks

TCP Checks: A simple 'three-way handshake' to see if the port is open. It's fast but can be misleading; a process might accept connections but be stuck in a deadlock internally.
HTTP Checks: The gold standard for applications. The load balancer requests a specific path (e.g., /healthz). The application can perform internal dependency checks (DB connectivity, disk space) before returning 200 OK.

bash

3. Circuit Breaker Pattern

When a downstream service is struggling, continuing to hammer it with requests makes things worse. A Circuit Breaker monitors failures and 'trips' (opens) after a threshold is reached, immediately failing subsequent calls to protect the system. It eventually transitions to 'Half-Open' to test if the service has recovered.

4. Graceful Degradation

If a health check fails or a circuit breaker trips, your application should degrade gracefully. Instead of a generic error, return cached data, default values, or a simplified UI that doesn't rely on the failing component.

5. Connection Draining

To perform maintenance without dropping user connections, we use Draining (or De-registration Delay). The load balancer stops sending new requests to the node but allows in-flight requests to complete within a timeout period.

bash