عجفت الغور

reflections on programming 5 years

drafts

I’m entering my sixth year of programming this year, and like any developer, you begin to have all sorts of thoughts. This is a loose collection of major lessons I’ve learned throughout the past five years. Funnily enough, I think most of this is not directly teachable, the majority of these lessons I’ve arrived at by failing over and over.

Outage Driven Development

I’ve worked at two startups and one large company, and one of the startups was a cloud metrics company, so I’ve seen my fair share of outages. There’s a variety of things to be stated about handling an outage (SRE runbook et al), but I think it’s important to think about how outages affect our development process. Post-postmortem period, there’s often a frantic scramble to set up systems such that an outage doesn’t happen again, but oftentimes the same outage never happens again. Postmortems will often identify the root cause, but more often than not, a single problem is unlikely to cause a major outage. Outages happen because problems cascade into another, spreading faster than operators expect. The often-used method of preventing future outages is to place more bureaucracy in place (feature flags! canary versions! blue/green deploys!). I’m largely in favor of these, but having seen these cut into the release process, I’ve soured on feature flags that require manual enablement, purely because they 1) duplicate the amount of code needed and 2) are often forgotten about after deployment is done. Removing feature flags is a non-trivial cost to the development cycle, and I end up leaning towards systems that are totally automated during the development process, such as automated sandbox tests, or replay tests in CI. I think replay tests are hugely powerful, capturing a section of live traffic and using some form of network dialogue minimization, coupled with a sandbox environment provides you with good guarantees for what your code is going to look like. Ideally, the best way to think about deploys is like a machine learning model between training and inference: you want to prevent distributional shift. The traffic you code against should be the traffic you run against, and reply testing goes a long way towards that. We should take care to not have processes that hamper developer velocity due to previous outages. Outages, while they represent useful sources of information, often get leaned on by developers as whispered taboos, things that we shouldn’t do because they caused a problem in the past. I used to work with a group that absolutely refused to use RabbitMQ because of a single major outage 6 years ago. Oftentimes, juniors, who remain unshackled by past outages, will bring this stuff up. We should embrace that (within reason) because outages are a great experience for juniors to learn as well. I’m very fond of bringing juniors on during an outage (not front-line, but having them participate in the actual outage and the postmortem) because outage handling is like riding a bike: I can tell you all the physics that goes into a bike, but unless you get on and fall, you’re not going to learn.

Code

Optimize for readability is an oft-repeated phrase, and I am in pretty full-throated agreement. However long you expect your code to live, take that time and double it, because the half-life of any single piece of code is going to be longer than you expect. People who say there shouldn’t be comments are dead wrong, programming is an oral tradition, and we should use that to its full advantage. Code does not speak for itself, so we have to. I love this tweet from @mcclure111, where she says:

Since most software doesn’t have a formal spec, most software “is what it does”, there’s an incredible pressure to respect authorial intent when editing someone else’s code. You don’t know which quirks are load-bearing.

Designs

Design docs rule. The goal of a design doc is not to capture accurate information about what a future system does, but rather to preserve the context in which a piece of code was written. Ideally, a good design doc tells you: what state systems are related to are at, what you would like the current system to be in the future, and what are potential pitfalls. This is immensely powerful, a chain of design docs provides you with the evolution of the system. Not only do design docs work by making sure everyone is on the same page before writing any code, but the long tail benefit of design docs provides better onboarding experiences since most of their questions about quirks should be answered in the RFCs.

Tools

PG/Kube/Terraform/Consul/Cassandra/Protobuf/Flatbuf/GraphQL/Kafka/Rabbit/etc are all the same stuff. I don’t even mean that cheekily, in the modern era, if your thing is not an RPC thing with a database and some special pipes in between, what are you even doing? I used to fight a lot about the particular tool, but now I’m more in favor of just asking “is it an RPC system, a database, or a pipe? Is any piece trying to be more than that?”

Use code search. Not the terrible Github code search, but code search that’s aware of syntax, like OpenGrok or SourceGraph, or whatever your company has, or whatever your IDE is. Code search will save you an immense amount of time during development and outages.