Skip to main content

Network Troubleshooting and Monitoring

Network Monitoring and Alerting

0:00
LearnStep 1/3

Production Network Observability

Key Network Metrics for Reliability

For senior engineers, network monitoring goes beyond checking if a host is up. We focus on the USE Method (Utilization, Saturation, Errors) for resources and the RED Method (Rate, Errors, Duration) for services.

  • Throughput: Bytes sent/received (Bandwidth utilization).
  • Latency: Round Trip Time (RTT), connection establishment time.
  • Packet Loss & Errors: Retransmissions, dropped packets, CRC errors.
  • Saturation: Conntrack table usage, file descriptor limits.

Prometheus & Exporters

Prometheus is the industry standard for metric collection. For network data, we primarily rely on:

1. Node Exporter

Exposes hardware and OS metrics exposed by *NIX kernels.

bash

Key Metrics to Watch:

  • node_network_receive_bytes_total / node_network_transmit_bytes_total
  • node_network_receive_errs_total / node_network_transmit_drop_total
  • node_netstat_Tcp_RetransSegs (Critical for detecting network congestion)

2. Blackbox Exporter

Allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, and ICMP.

yaml

Alerting & SLOs

Avoid alert fatigue by alerting on Symptoms (User Pain) rather than Causes. Define an SLO (e.g., "99.9% of requests within 100ms internal network latency") and alert on the error budget burn rate.

Example Alert Rule (PromQL): High TCP Retransmission Rate

yaml