outage and debugging for the warring programmer
drafts, postmortems and outages
I’ve been shoring up our way of handling outages and issues at work lately, and I figured I’d write up something about it, and how I try to handle outages. This post assumes you’ve know what /proc is others are.
- Alert comes in
- Triage Mode
- Outage? Are we violating SLO’s? Check dashboards and alerts
- Crash recovery? What happens if the service crashes right now, are in flight jobs recoverable, or would we violate data?
- Rollback? Deployments? Fix forward? Figure this out before moving to the next stage
- Debugging mode
- Is the information obvious in dashboards? Sometimes the answer is obvious with your specific metrics, sometimes it manifests as “more 500s or TCP errors”.
- Flip service to debug mode, either via debug level logs or check the pprof port or whatever
- Check logs, either via the cloud logger or box on the machine (if cloud logs are too slow). Check
ps aux | grep -i <SERVICE_NAME>
, usually log configuration is passed in via the command line or a special config- If not, then resort to
lsof
to find open files and see if anything looks like a log - Log analytics
grep -i <thing> | cut -d " " | sort | uniq
- If not, then resort to
- Is the program stalled? Did the service crash and restart? Is there a core dump? Are the dependent services routable? OS error?
- OOMKilled -> check syslogs, check dmseg
- Crashes -> Pop the core dump and see if there’s anything useful
- Stalled ->
ps aux | grep
andstrace interface=network -p <PID>
. Maybetcpdump
to validate that your services are actually recieving traffic. - Dependencies -> Can you actually talk to your dependencies? This usually manifests in the logs first, but
ping <DEP>
ornc
to check - OS Error ->
dmesg
- Other pathologies?
- Deeper debugging needed, get a packet capture via
tcpdump -o
and place somewhere for retrying pprof
?:observer
?gdb~/~pdb
? Get something that allows you to inspect program internal state. Possibly the only real way to do this without dramatically impacting performance is sampling ingdb
, take a snapshot every 10ms or something.- If you can redeploy and test with new logs, do it. If you can’t (possibly because you don’t know the conditions), the best you can do is get the right set of data to reproduce the metric.
- Deeper debugging needed, get a packet capture via