What is Infrastructure Monitoring?
Infrastructure monitoring tracks the health of hosts, containers, networks, and cloud services beneath your apps: CPU, memory, disk, network, queue depth.
What is infrastructure monitoring?
Infrastructure monitoring tracks the health of the hosts, containers, networks, and managed cloud services that sit beneath your application code. The standard set of signals is CPU utilization, memory pressure, disk I/O and free space, network throughput and packet loss, plus service-specific metrics: queue depth on SQS, replica lag on a database, connection-pool usage on a load balancer, target-health on an ECS service. Infrastructure monitoring answers the question "is the layer below my app healthy" before you start asking why your app itself is slow.
An infrastructure-monitoring agent (Datadog Agent, Prometheus node_exporter, AWS CloudWatch Agent, Telegraf, Beats) runs on each host or as a sidecar in each container. It scrapes OS-level counters (procfs, /sys, performance counters on Windows), polls the cloud provider's metrics API (CloudWatch, Azure Monitor, GCP Monitoring), and ships the time series to a backend for storage, query, and alerting.
Infrastructure monitoring vs application monitoring
Two layers, both needed, often mixed up:
- Infrastructure monitoring: the host, the container, the cloud service. CPU, memory, disk, queue depth. Answers "is the platform healthy?"
- Application monitoring (APM): the code running on top. Endpoint latency, error rate, traces. Answers "is the app behaving?" See APM.
Both layers fail in characteristic ways. CPU pegged at 100% is an infrastructure signal; you fix it with more capacity or a hot loop in code. p95 latency climbing while CPU stays flat is an application signal; the fix is usually a slow query or a downstream API. Modern observability platforms correlate the two: observability as a discipline emerged partly because alerting on isolated host metrics produced too many false positives without app context.
What infrastructure monitoring covers
- Hosts and VMs: CPU, load average, memory, swap, disk I/O, free disk, inode usage, file-descriptor count, process count, kernel events.
- Containers (Docker, Kubernetes): per-container CPU, memory, restart count, OOMKilled events, pod readiness, node pressure, image-pull failures.
- Networks: throughput in/out, packet loss, retransmissions, connection-tracking table fullness, security-group flow logs.
- Load balancers: target health, request count, 5xx rate, latency p95 at the LB layer, connection counts.
- Databases: connections used, replica lag, query throughput, slow-query log, cache hit ratio, lock contention.
- Message queues: queue depth, message age, consumer lag, dead-letter count.
- Managed cloud services: SQS depth, S3 4xx/5xx, DynamoDB throttled requests, Lambda concurrency, RDS CPU and connections.
Key infrastructure alerts
- Disk free under 15% on any host. Catches log-rotation failures and runaway temp files before they take the service down.
- CPU sustained above 80% for 10+ minutes. Tells you a host is at capacity, often before app latency spikes.
- Memory pressure or OOMKilled events on any container. Often the first symptom of a memory leak or a JVM heap that needs tuning.
- Load-balancer target unhealthy for 2+ minutes. Direct signal that traffic is being routed away from a host that should be serving.
- Queue depth above N where N is your consumer can-process-in-15-min number. Catches consumer crashes or downstream slowness before customers notice.
- Network packet loss above 1% sustained. Usually a switch, a security group misconfiguration, or a misbehaving NIC.
How to set it up
For cloud-native stacks: enable CloudWatch (or your provider's native monitoring) for baseline, then install the Datadog Agent, Prometheus node_exporter plus Grafana, or your APM vendor's infrastructure module on each host. For Kubernetes: kube-state-metrics plus node_exporter scraped by Prometheus is the OSS default; the Datadog Cluster Agent and the New Relic Kubernetes integration are the commercial equivalents. Add alerts incrementally; start with the six above and expand as you learn your failure modes.
Pair infrastructure monitoring with load tests to validate the alerts. Run load testing, spike testing, or capacity testing against staging and watch which host metric climbs first. If your CPU alert at 80% fires 30 seconds before your endpoint latency alert, that is the early-warning system you want in production. If it never fires, your threshold is wrong.
If your team needs infrastructure-load correlation under production-shape traffic (which host metric breaks first, at what RPS, in which region), LoadFocus offers load testing services where engineers run the matrix and produce the breakdown.
Related LoadFocus Tools
Put this concept into practice with LoadFocus — the same platform that powers everything you just read about.