Network Troubleshooting and Monitoring

Network Monitoring and Alerting

0:00

0% complete

LearnStep 1 of 3

Production Network Observability

Learning Objectives

•Identify key network metrics (USE/RED methods) critical for distributed system health
•Configure Prometheus exporters (Node Exporter, Blackbox Exporter) to scrape network telemetry
•Design effective Grafana dashboards to visualize latency, throughput, and packet loss
•Define network Service Level Objectives (SLOs) and implement actionable alerting strategies

Lesson Outline

LearnStep 1/3

Production Network Observability

Key Network Metrics for Reliability

For senior engineers, network monitoring goes beyond checking if a host is up. We focus on the USE Method (Utilization, Saturation, Errors) for resources and the RED Method (Rate, Errors, Duration) for services.

Throughput: Bytes sent/received (Bandwidth utilization).
Latency: Round Trip Time (RTT), connection establishment time.
Packet Loss & Errors: Retransmissions, dropped packets, CRC errors.
Saturation: Conntrack table usage, file descriptor limits.

Prometheus & Exporters

Prometheus is the industry standard for metric collection. For network data, we primarily rely on:

1. Node Exporter

Exposes hardware and OS metrics exposed by *NIX kernels.

bash

Key Metrics to Watch:

node_network_receive_bytes_total / node_network_transmit_bytes_total
node_network_receive_errs_total / node_network_transmit_drop_total
node_netstat_Tcp_RetransSegs (Critical for detecting network congestion)

2. Blackbox Exporter

Allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, and ICMP.

yaml

Alerting & SLOs

Avoid alert fatigue by alerting on Symptoms (User Pain) rather than Causes. Define an SLO (e.g., "99.9% of requests within 100ms internal network latency") and alert on the error budget burn rate.

Example Alert Rule (PromQL): High TCP Retransmission Rate

yaml