18 July 2021

Safety Nets Are Not There to be Used

Looking at some of the drives for improvement that tend to be given top billing, there is often an emphasis on levels of automated testing and other safety measures as part of CI/CD. These are all important but one thing that is not often stressed is the fundamental importance of engineers in the process.

Ultimately, whatever is running in production originated as lines of code written by one or more individual engineers - at no other point in time can any human or automated part of the process be expected to have the same degree or concentration of relevant knowledge about the code and the context in which it was written. Given this fact it is essential that we invest time and money in getting the best engineers and growing juniors into the best they can be and ensure they all have a real appreciation of how critical their role is - sure, mistakes will always happen, and things will change that allow bugs to emerge unexpectedly, but concentrating too much on the downstream elements of the software delivery process is like employing more and more people to place their hands over the cracks in a dam instead of finding someone who can figure out how to strengthen the dam - you can't hold back the inevitable torrent from a mass of leaks.

And what is it that we need from good engineers? It is almost certainly not their experience of writing code - frankly anyone can do that. If software engineering was just about writing algorithms to fulfil requirements you wouldn't need engineers - it would all come down to accurate and concise specification of requirements - rules for mapping input to expected output - and that is indeed a niche that is filled by low-code solutions. However these don't scale to enterprise platforms because the real engineering craft is about seeing connections, understanding unwritten implications, and that is the skill that we need to invest in - otherwise it's like taking good bricklayers and expecting them to be able to build a house - that just won't work unless they understand about lintels and beams and the implications of not having the right ones (okay sorry, I'll stop the analogies now).

So how do we get the best engineering teams? I'd say it's all about continuous practice at problem solving and fixing things - ideally most of this practice is specifically in the software landscape that they are part of, this is the knowledge and understanding they need to have, but other problem solving helps - cryptic crosswords are great for practising spotting connections and seeing past the obvious interpretations; striving to excel in sport, music and art are all things that help us learn to spot patterns, make decisions and reflect on how the implications of those decisions pan out (in real time).

When it comes to problem solving in software, this is a key benefit of a DevOps team. The alternative, a development team that just writes new code, will focus on what they need to know to carry out a task, but how do they know what they need to know? It is really hard to know enough about the far reaches of the system and become familiar with the folklore surrounding it if you are not interacting with it regularly. When the development team is responsible for ALL production issues, then each individual member is constantly being exposed to unfamiliar parts of the system - honing their skills in investigating their own and other people's code and documentation. Every team member needs to be great at tracking down and fixing bugs and that is how we discover the connections between components, the styles and reasoning that sit behind the code, the implications of what is running where and how it affects the code we are familiar with. Also, when a team is having to deal with tedious simple problems, they will be driven to figure out how to prevent those, to monitor for them, to recover from them - they learn about proactively pre-empting and anticipating problems that haven't yet happened. Rather than having a first line support team working from a playbook to deal with the same questions again and again, a team that is being bogged down by those questions will figure out how to remove them, and become a better and more knowledgeable team in the process (Okay, I admit that in our team we don't have to provide 24/7 first line support - if we did I might be happy to accept a little bit of dedicated support!)

 


 

So once the code is written what are we getting out of the other elements of our path to production - are we clear about their role? They are all important but crucially none of them are there to compensate for insufficiently good code - when problems occur in production we might look at the safety measures that have been missed from the process but the problem to really address is always one of knowledge or understanding in the team (maybe exacerbated by a belief that the process is there to protect us against mistakes).

Some might view unit tests as a first line of defence. However really they are a coding tool rather than a safety measure. They are there to confirm what the code says and their role is to help future developers understand the assumptions and expectations on which existing code is built. They are only asserting the beliefs of the engineer that wrote them, so they cannot catch mistakes in those beliefs. You can easily have lot of test coverage that asserts that your code does precisely the wrong thing. That's not going to help you - unit tests are (naturally) testing a single isolated unit of code - they are not testing the context in which that unit runs - you are still relying on the developer of both code and unit tests to understand that context.

Peer review might be seen as the next line of defence. While this may be a place to catch some code that was based on incorrect assumptions or understanding, ideally those mistakes shouldn't get that far - they should be figured out, discussed and questioned before and during the writing of code - it's why we have product owners and documentation and clearly written existing code - these things are there to help the engineer fully understand the context of what they are doing. Peer reviews can be a great opportunity for discussion and learning - especially to engender a critical and questioning culture in the team - but if they are needed to check that engineers have understood the requirement and know the coding basics then we might be squandering the valuable time and resources of the team.

Looking at automation, we have build tools and deployment gateways - code quality scans, plugins to enforce code rules, vulnerability scans, automated penetration tests, automated regression tests.

Code quality scans and enforcement rules are purely a development and learning tool - I've certainly been surprised by findings and they have driven me to read up and understand the reasoning behind them. That's really valuable for me, but I have also seen these scans used as a gateway to acceptance of code from external teams - this is a terrible idea and a sure way of letting technical debt accrue - it is very easy to write appalling and unmaintainable code that scores high on a quality scan - it measures very specific aspects of code and does that well. It knows nothing about naming, meaningful documentation, even the most basic information about business rules and what the code is hoping to achieve.

Now in vulnerability scans and penetration testing we do have some really valuable safety nets - but specifically for problems that don't originate in our code - that's fundamental - although we want engineers who have a breadth of knowledge and who are happy to delve into open source code (or decompiled third party code) and understand it, the quantity of external code in any non-trivial system makes it impossible and undesirable to know how all of it works, or even just to keep up to date with CVEs - you can only hold onto a certain amount of knowledge and we do want to leave some space for our own code. With penetration tests the same is true when we are finding flaws in external code but in this case the tests may also find flaws in our own code and I'd argue that if you are writing code that has any kind of attack surface (ie all code) you need to be aware of, and implement defences against, those attacks before finding out about them in a test result - hence, like code quality scans these are primarily learning opportunities rather than a rubber stamp for a safe release to production, especially as you are trusting the test to be based on a complete traversal of the application - hopefully it is, but are you sure?

Finally regression tests, which are often misunderstood - engendering a belief that you can add more and more cases to the regression suite and then sleep easy. Yes, regression tests should keep your highest value and highest risk areas covered, and ring alarm bells when something slips through to compromise their smooth working, but regression tests should not cover more than that, and cannot reliably do so - otherwise test maintenance will gradually become the focus of all your team resources and you have the same fundamental issue as the original code - the regression test code has to understand the context of what is happening in the system - you could put an awful lot of time into regression tests for scenarios that don't quite reflect reality, or at least the current reality.

 


 

In summary, invest in automation but always be conscious of what and how much protection each element offers - it is not a substitute for due diligence by the development team - sure, you need to spend time and money on safety nets, and best make sure they're good ones, but don't forget to hire damn good trapeze artists and make sure they're practising every day - because there's going to be an awful mess if they miss the safety nets, crash into each other, or smack straight into the pole that's keeping your whole big top standing! (and that's the last analogy, I promise)