Chris's Blog: November 2020

In our team we try to deploy to production every sprint. We run a platform composed of a number of microservices (hey, don't we all now) and so as each component progressively moves forward, at some point we can see that a feature emerges in production.

It's in the nature of our platform that many of our features are inward facing and so these small deployments tend to be safe - even if something goes wrong it is not a big issue especially as we've put in the fundamental design groundwork to ensure that our components recover gracefully and generally without manual intervention. However just recently we've had a couple of features with dependencies on external teams and we have been holding off the production release of individual stories pending integration testing. This big-bang approach always has been painful and and these large external facing features brought the issue of backward compatibility into focus, especially as other features are in progress at the same time - having lost the momentum of regular releases, I found the question of whether a single component can be safely released becoming more and more difficult to answer with 100% certainty.

At this point, we hit an example of where we had got this spectacularly wrong - so here was a great opportunity to learn something.

Here's the background - our platform's components for the most part talk to each other by producing or consuming events. These events have to evolve from time to time so we have a 2 level version number in each event - top level for breaking changes, second level for non-breaking changes - eg 1.0. One of the key events in the system is produced by one component and consumed by three. Several months ago it had to be changed in a non-compatible way and so it went from version 1.0 to version 2.0, and then a few weeks ago we added some new properties and so it went up to 2.1. In theory the producing component could go live without affecting anyone as this is a non-breaking change. However as soon as we got into E1 we discovered this was not the case - 2.1 events could not be handled by the existing consumers. The result is that our producer is dependent on those other three components. It can't go live until their fixed versions go live. One of those components is outward facing, and as the external users timescales have slipped so has the release of that component and hence our event producer. Now we are in a world of pain. Other changes are backing up but can't be released. Each slippage requires discussions and meetings to understand the impact and the new order of releases. This is a dark place - we will emerge, but with scars!

So how did a backward compatible change become a breaking change? The answer is in one single line of code written a long time ago.

The three consumers of the event all share a common deserialiser library (okay, there's another discussion there but that's for another day). When the event went from 1.0 to 2.0 some custom code was added. The code that was written effectively said:

if (event.version.equals("2.0") {  
  // treat this as a 2.n style event  
} else {  
  // treat this as a 1.n style event  
}  

You can probably see the problem already - line 01 means that a version 2.1 event gets treated as a 1.n style event and deserialisation fails.

Now I don't know who wrote this code, nor does it matter. With hindsight it is easy to see the problem, and the fix is trivial, but months ago when this was first written, it's an easy mistake to make, and mistake is probably even a strong word - all the unit tests passed and this code worked fine for months. I'm sure I've written worse things and will do so again.

We could look at what happened, come up with a coding guideline about dealing with event versions, stick it on our internal wiki and move on, but that would be fairly futile - the same situation may not arise for months or years if ever, and it is missing the wider point.

The real thing to realise here is that although at the time of writing it was easy to miss the potential problem, it is also the case that it was entirely possible to recognise the potential problem. There are no unknowns here - looking at the line of code, the fact that it is referring to a two level version number means that implicitly we can know that at some point in the future the next thing that will happen is going to be either a minor version increment or a major version increment. So even with limited context it is clear what is inevitable, and the code could account for that without knowing exactly what the next change will be or when that will happen.

A key point is that some of the things that our code has to deal with are inevitable - for example future changes to requirements, failures when making outside calls, invalid input from outside. Every line of code that we write, however small and innocuous it may seem needs to account for the things that are inevitable, whether that involves defensive treatment of input, adding useful logging, or ensuring the way we write branching statements accounts for inevitable change. This isn't the same as over-engineering - writing complex generic code to cater for hypothetical future requirements that may never happen, this is engineering for the inevitable.

One of the main ways of achieving this is to not rush. We don't need to. Our capacity to release useful functionality is not constrained by our ability to write lines of code. Given a list of problems, any of us could write hundreds, maybe thousands of lines of code within a sprint to solve those problems. Our capacity to release functionality is more often constrained by the need to deal with the fallout from previous decisions - looking into unexpected behaviour and trying to figure out what happened and why, trying to update code that has gradually deteriorated into a ball of spaghetti, trying to figure out the order of deployments because a line of code turned out to change a backward compatible interface change into a breaking change.

What marks us out as professional engineers is not our ability to write lines of code, it is our ability to consider each line of code and understand where it sits in the present and the future - how does it live alongside existing code, how will it cope with everything that the outside world and the future throws at it. We won't always get it right but recognising this means that we can always strive to do better. We need to give those lines of code the love and attention they deserve.

Chris's Blog

09 November 2020

Every Line of Code Needs Love, Care and Attention

About Me

Blog Archive