Skip to content

Network Troubleshooting Flow

First PublishedByAtif Alam

Use this page as a checklist during incidents and design reviews. Each step links to deeper material; do not skip observability before chasing packets.

  • User-facing: timeout, HTTP status (4xx/5xx), partial outage, one region only?
  • Scope: single client, single AZ, single service, or global?
  • Timing: change window (deploy, DNS, SG, route)?
  • Resolve the name from the same vantage as the client (dig, different resolvers). Route 53.
  • TTL and caching — Stale answers after cutover.
  • Split horizon — Private vs public zone answers.
  • Security groups — Stateful: allow return traffic implicitly; verify LB → target and target → dependency rules. Networking.
  • NACLs — Stateless: ephemeral ports for return traffic on tight rules.
  • Route tablesBlackhole, wrong NAT, IGW only on public subnets, TGW/peering for cross-VPC.
  • NAT Gateway — Private subnet outbound failures (updates, APIs) often trace to NAT or route.
  • Asymmetric routing — Half the path through different edges; intermittent TCP failures.
  • VPC Flow LogsACCEPT vs REJECT, bytes, src/dst. Flow logs and network RCA.
  • VPC Reachability Analyzer — Modeled path when SG/NACL/route are unclear.
  • Listening socketsss -tlnp (see System calls and Linux network configuration).
  • Routes and DNS clientip route, resolver config.
  • Ping / traceroute / mtrICMP success does not guarantee TCP to port; firewalls block ICMP.
  • Packet capture — Narrow filters, short windows. Packet capture.
  • Istio — Sidecar injection, istioctl analyze, upstream errors. Istio.
flowchart TD
S[Symptom] --> O[Logs metrics traces]
O --> D[DNS]
D --> L[Load balancer]
L --> V[VPC SG NACL routes NAT]
V --> F[Flow logs reachability]
F --> H[Host ss route tcpdump]
H --> M[Mesh Istio if used]