Network Troubleshooting Flow
Use this page as a checklist during incidents and design reviews. Each step links to deeper material; do not skip observability before chasing packets.
1. Clarify the Symptom
Section titled “1. Clarify the Symptom”- User-facing: timeout, HTTP status (4xx/5xx), partial outage, one region only?
- Scope: single client, single AZ, single service, or global?
- Timing: change window (deploy, DNS, SG, route)?
2. Application and Observability
Section titled “2. Application and Observability”- Logs and metrics — Error rate, latency, dependency timeouts. See Observability.
- Traces — Which hop adds latency? Distributed tracing.
- HTTP semantics — Who generated the status: WAF, load balancer, or app? HTTP for Operators.
3. DNS
Section titled “3. DNS”- Resolve the name from the same vantage as the client (
dig, different resolvers). Route 53. - TTL and caching — Stale answers after cutover.
- Split horizon — Private vs public zone answers.
4. Load Balancer (If in Path)
Section titled “4. Load Balancer (If in Path)”- Target health, listener rules, idle timeout, TLS/SNI. Elastic Load Balancing.
- Kubernetes — Ingress /
LoadBalancerService / AWS LB controller. Kubernetes networking.
5. VPC Path (AWS)
Section titled “5. VPC Path (AWS)”- Security groups — Stateful: allow return traffic implicitly; verify LB → target and target → dependency rules. Networking.
- NACLs — Stateless: ephemeral ports for return traffic on tight rules.
- Route tables — Blackhole, wrong NAT, IGW only on public subnets, TGW/peering for cross-VPC.
- NAT Gateway — Private subnet outbound failures (updates, APIs) often trace to NAT or route.
- Asymmetric routing — Half the path through different edges; intermittent TCP failures.
6. Flow Logs and Reachability
Section titled “6. Flow Logs and Reachability”- VPC Flow Logs — ACCEPT vs REJECT, bytes, src/dst. Flow logs and network RCA.
- VPC Reachability Analyzer — Modeled path when SG/NACL/route are unclear.
7. Host and Linux Networking
Section titled “7. Host and Linux Networking”- Listening sockets —
ss -tlnp(see System calls and Linux network configuration). - Routes and DNS client —
ip route, resolver config. - Ping / traceroute / mtr — ICMP success does not guarantee TCP to port; firewalls block ICMP.
- Packet capture — Narrow filters, short windows. Packet capture.
8. Service Mesh (If Applicable)
Section titled “8. Service Mesh (If Applicable)”- Istio — Sidecar injection,
istioctl analyze, upstream errors. Istio.
9. Close the Loop
Section titled “9. Close the Loop”- Document timeline, root cause, and corrective vs preventive actions. QA and reliability guide.
Mental Model
Section titled “Mental Model”flowchart TD S[Symptom] --> O[Logs metrics traces] O --> D[DNS] D --> L[Load balancer] L --> V[VPC SG NACL routes NAT] V --> F[Flow logs reachability] F --> H[Host ss route tcpdump] H --> M[Mesh Istio if used]Related
Section titled “Related”- TCP/IP primer — Layers and MTU.
- Azure networking — Parallel concepts outside AWS.