outage and debugging for the warring programmer

drafts, postmortems and outages

I’ve been shoring up our way of handling outages and issues at work lately, and I figured I’d write up something about it, and how I try to handle outages. This post assumes you’ve know what /proc is others are.

Alert comes in
Triage Mode
- Outage? Are we violating SLO’s? Check dashboards and alerts
- Crash recovery? What happens if the service crashes right now, are in flight jobs recoverable, or would we violate data?
- Rollback? Deployments? Fix forward? Figure this out before moving to the next stage
Debugging mode
1. Is the information obvious in dashboards? Sometimes the answer is obvious with your specific metrics, sometimes it manifests as “more 500s or TCP errors”.
2. Flip service to debug mode, either via debug level logs or check the pprof port or whatever
3. Check logs, either via the cloud logger or box on the machine (if cloud logs are too slow). Check ps aux | grep -i <SERVICE_NAME>, usually log configuration is passed in via the command line or a special config
  - If not, then resort to lsof to find open files and see if anything looks like a log
  - Log analytics
    - grep -i <thing> | cut -d " " | sort | uniq
4. Is the program stalled? Did the service crash and restart? Is there a core dump? Are the dependent services routable? OS error?
  - OOMKilled -> check syslogs, check dmseg
  - Crashes -> Pop the core dump and see if there’s anything useful
  - Stalled -> ps aux | grep and strace interface=network -p <PID>. Maybe tcpdump to validate that your services are actually recieving traffic.
  - Dependencies -> Can you actually talk to your dependencies? This usually manifests in the logs first, but ping <DEP> or nc to check
  - OS Error -> dmesg
5. Other pathologies?
  - Deeper debugging needed, get a packet capture via tcpdump -o and place somewhere for retrying
  - pprof? :observer? gdb~/~pdb? Get something that allows you to inspect program internal state. Possibly the only real way to do this without dramatically impacting performance is sampling in gdb, take a snapshot every 10ms or something.
  - If you can redeploy and test with new logs, do it. If you can’t (possibly because you don’t know the conditions), the best you can do is get the right set of data to reproduce the metric.