18 July 2021

Safety Nets Are Not There to be Used

Looking at some of the drives for improvement that tend to be given top billing, there is often an emphasis on levels of automated testing and other safety measures as part of CI/CD. These are all important but one thing that is not often stressed is the fundamental importance of engineers in the process.

Ultimately, whatever is running in production originated as lines of code written by one or more individual engineers - at no other point in time can any human or automated part of the process be expected to have the same degree or concentration of relevant knowledge about the code and the context in which it was written. Given this fact it is essential that we invest time and money in getting the best engineers and growing juniors into the best they can be and ensure they all have a real appreciation of how critical their role is - sure, mistakes will always happen, and things will change that allow bugs to emerge unexpectedly, but concentrating too much on the downstream elements of the software delivery process is like employing more and more people to place their hands over the cracks in a dam instead of finding someone who can figure out how to strengthen the dam - you can't hold back the inevitable torrent from a mass of leaks.

And what is it that we need from good engineers? It is almost certainly not their experience of writing code - frankly anyone can do that. If software engineering was just about writing algorithms to fulfil requirements you wouldn't need engineers - it would all come down to accurate and concise specification of requirements - rules for mapping input to expected output - and that is indeed a niche that is filled by low-code solutions. However these don't scale to enterprise platforms because the real engineering craft is about seeing connections, understanding unwritten implications, and that is the skill that we need to invest in - otherwise it's like taking good bricklayers and expecting them to be able to build a house - that just won't work unless they understand about lintels and beams and the implications of not having the right ones (okay sorry, I'll stop the analogies now).

So how do we get the best engineering teams? I'd say it's all about continuous practice at problem solving and fixing things - ideally most of this practice is specifically in the software landscape that they are part of, this is the knowledge and understanding they need to have, but other problem solving helps - cryptic crosswords are great for practising spotting connections and seeing past the obvious interpretations; striving to excel in sport, music and art are all things that help us learn to spot patterns, make decisions and reflect on how the implications of those decisions pan out (in real time).

When it comes to problem solving in software, this is a key benefit of a DevOps team. The alternative, a development team that just writes new code, will focus on what they need to know to carry out a task, but how do they know what they need to know? It is really hard to know enough about the far reaches of the system and become familiar with the folklore surrounding it if you are not interacting with it regularly. When the development team is responsible for ALL production issues, then each individual member is constantly being exposed to unfamiliar parts of the system - honing their skills in investigating their own and other people's code and documentation. Every team member needs to be great at tracking down and fixing bugs and that is how we discover the connections between components, the styles and reasoning that sit behind the code, the implications of what is running where and how it affects the code we are familiar with. Also, when a team is having to deal with tedious simple problems, they will be driven to figure out how to prevent those, to monitor for them, to recover from them - they learn about proactively pre-empting and anticipating problems that haven't yet happened. Rather than having a first line support team working from a playbook to deal with the same questions again and again, a team that is being bogged down by those questions will figure out how to remove them, and become a better and more knowledgeable team in the process (Okay, I admit that in our team we don't have to provide 24/7 first line support - if we did I might be happy to accept a little bit of dedicated support!)

 


 

So once the code is written what are we getting out of the other elements of our path to production - are we clear about their role? They are all important but crucially none of them are there to compensate for insufficiently good code - when problems occur in production we might look at the safety measures that have been missed from the process but the problem to really address is always one of knowledge or understanding in the team (maybe exacerbated by a belief that the process is there to protect us against mistakes).

Some might view unit tests as a first line of defence. However really they are a coding tool rather than a safety measure. They are there to confirm what the code says and their role is to help future developers understand the assumptions and expectations on which existing code is built. They are only asserting the beliefs of the engineer that wrote them, so they cannot catch mistakes in those beliefs. You can easily have lot of test coverage that asserts that your code does precisely the wrong thing. That's not going to help you - unit tests are (naturally) testing a single isolated unit of code - they are not testing the context in which that unit runs - you are still relying on the developer of both code and unit tests to understand that context.

Peer review might be seen as the next line of defence. While this may be a place to catch some code that was based on incorrect assumptions or understanding, ideally those mistakes shouldn't get that far - they should be figured out, discussed and questioned before and during the writing of code - it's why we have product owners and documentation and clearly written existing code - these things are there to help the engineer fully understand the context of what they are doing. Peer reviews can be a great opportunity for discussion and learning - especially to engender a critical and questioning culture in the team - but if they are needed to check that engineers have understood the requirement and know the coding basics then we might be squandering the valuable time and resources of the team.

Looking at automation, we have build tools and deployment gateways - code quality scans, plugins to enforce code rules, vulnerability scans, automated penetration tests, automated regression tests.

Code quality scans and enforcement rules are purely a development and learning tool - I've certainly been surprised by findings and they have driven me to read up and understand the reasoning behind them. That's really valuable for me, but I have also seen these scans used as a gateway to acceptance of code from external teams - this is a terrible idea and a sure way of letting technical debt accrue - it is very easy to write appalling and unmaintainable code that scores high on a quality scan - it measures very specific aspects of code and does that well. It knows nothing about naming, meaningful documentation, even the most basic information about business rules and what the code is hoping to achieve.

Now in vulnerability scans and penetration testing we do have some really valuable safety nets - but specifically for problems that don't originate in our code - that's fundamental - although we want engineers who have a breadth of knowledge and who are happy to delve into open source code (or decompiled third party code) and understand it, the quantity of external code in any non-trivial system makes it impossible and undesirable to know how all of it works, or even just to keep up to date with CVEs - you can only hold onto a certain amount of knowledge and we do want to leave some space for our own code. With penetration tests the same is true when we are finding flaws in external code but in this case the tests may also find flaws in our own code and I'd argue that if you are writing code that has any kind of attack surface (ie all code) you need to be aware of, and implement defences against, those attacks before finding out about them in a test result - hence, like code quality scans these are primarily learning opportunities rather than a rubber stamp for a safe release to production, especially as you are trusting the test to be based on a complete traversal of the application - hopefully it is, but are you sure?

Finally regression tests, which are often misunderstood - engendering a belief that you can add more and more cases to the regression suite and then sleep easy. Yes, regression tests should keep your highest value and highest risk areas covered, and ring alarm bells when something slips through to compromise their smooth working, but regression tests should not cover more than that, and cannot reliably do so - otherwise test maintenance will gradually become the focus of all your team resources and you have the same fundamental issue as the original code - the regression test code has to understand the context of what is happening in the system - you could put an awful lot of time into regression tests for scenarios that don't quite reflect reality, or at least the current reality.

 


 

In summary, invest in automation but always be conscious of what and how much protection each element offers - it is not a substitute for due diligence by the development team - sure, you need to spend time and money on safety nets, and best make sure they're good ones, but don't forget to hire damn good trapeze artists and make sure they're practising every day - because there's going to be an awful mess if they miss the safety nets, crash into each other, or smack straight into the pole that's keeping your whole big top standing! (and that's the last analogy, I promise)

15 February 2021

Dependency Management in Microservices with Shared Libraries

If you have a set of microservices using a common language, it is sometimes reasonable to extract commonly used behaviour into shared libraries. One of the consequences of this is an increase in the complexity of your dependency graph, and so there are some important decisions to make about how to manage versions of transitive dependencies on your own, and third party, libraries. This post specifically talks about Java or Kotlin with Maven.

If your code is going to remain fixed, and it doesn't depend on any libraries that could conceivably present any security risk, then it doesn't matter too much how you manage dependencies. However, in the real world there are two reasons that versions of transitive dependencies change - vulnerability patching and changes to functionality (either upgrading to a later version of an existing library for its new functionality, or bringing in a new library which depends on a different version of one of your existing dependencies).

Our goal in both cases (but particularly the first) should be to make version changing as simple as possible, and with as little risk of unforeseen consequences as possible.

There are three broad approaches:


  1. Manage versions from the bottom (shared libraries enforce versions of transitive dependencies)
  2. Manage versions from the top (microservices enforce versions of transitive dependencies)
  3. Manage versions from outside (versions are enforced in a separate common definition - for example the spring-boot parent)

 

At first glance the first approach sounds compelling especially if you have a large number of services - any vulnerability can be addressed in one place and will just propagate out to services from there. However there is a fallacy here and a few serious dangers.

The fallacy is that managing a version in one place simplifies things for microservices. This does not hold because the services still need to up-version their dependency on this shared library so rather than simplifying the management you have just added another level of indirection.

The dangers all relate to complexity. First, how can you be sure that one library is the only one managing that dependency? If two libraries use the same transitive dependency, which one should you choose to manage it? What if some microservices use one of those libraries and others use the other. You'd have to manage in both libraries and then what if they are out of step?

As soon as you have multiple shared libraries with the same transitive dependencies, addressing a security vulnerability starts with a time-consuming investigation to discover what is actually defining the version. Since maven 3, the verbose option of dependency:tree is broken. You can deliberately down-version with

 

  1. mvn org.apache.maven.plugins:maven-dependency-plugin:2.10:tree -Dverbose=true | less  

 

but while this tells you that a library has been dependency managed, it isn't necessarily trivial to discover which POM is actually defining that version


  1. 08:33:06,547 [INFO] |  |  +- com.github.bohnman:squiggly-filter-jackson:jar:1.3.6:compile  
  2. 08:33:06,547 [INFO] |  |  |  +- org.antlr:antlr4-runtime:jar:4.6:compile  
  3. 08:33:06,547 [INFO] |  |  |  +- (org.apache.commons:commons-lang3:jar:3.11:compile - version managed from 3.4; omitted for duplicate)  
  4. 08:33:06,548 [INFO] |  |  |  +- (com.google.guava:guava:jar:19.0:compile - omitted for conflict with 20.0)  
  5. 08:33:06,548 [INFO] |  |  |  +- (com.fasterxml.jackson.core:jackson-databind:jar:2.11.4:compile - version managed from 2.6.4; omitted for duplicate)  

 

Another serious problem is that vulnerability patching and functionality are now tied together - what happens if your microservice is depending on version 1.0.0 of a shared library, which has since evolved up to version 2.0.0 with a breaking change.  Now a vulnerability is discovered in a dependency and the shared library updates the managed dependency, releasing this as version 2.0.1. Your microservice can't up-version without addressing any backward compatibility issues, the simple route then is to dependency manage the vulnerable transitive dependency in your microservice POM - when there are multiple conflicting managed versions, maven takes a 'closest wins' strategy. However, now you are starting to create a sprawling mess of conflicting version management. Next time you need to deal with a security vulnerability, your job will be twice as hard.




The second approach is conceptually an improvement - microservices should be in control of their own versions. The reality is not so simple of course - if multiple versions of a transitive dependency are used by classes from different shared libraries at runtime, then unless you are running with OSGI one version will win, and you have to make sure that the version which wins is compatible with all usages. That is not something that can be solved by any particular strategy - it's just something that has to be managed, but do you want to figure that out and manage it in every one of your microservices? In the absence of breaking changes highest version should be the right choice (and hence that is the default behaviour in Gradle) so that maybe simplifies the decision making, but doesn't remove that overhead.

The downside of this approach is the sheer amount of dependency management that may be required. However a pragmatic 'laissez faire' approach is just to let the maven rules untangle dependencies and dependency management for all transitive dependencies until you detect a security vulnerability, and then dependency manage in the microservice at that point. The only downside is that in the presence of a large number of microservices, that can be a lot of versions being specified in a lot of places - the big downside is where microservices are not in active development the overhead of up-versioning various dependencies across many microservices can be time consuming.




The third approach addresses the downsides in both the other approaches to an extent. The main advantage is that rather than an ad-hoc wild-west set of versions being specified, there is some central control - where that can be managed by team members with the right level of expertise; and where microservices have an intelligent set of tests, this can reduce the risk of one up-version having unexpected results because of a usage that hadn't been fully considered. This is the approach taken by spring-boot, where a core of experts and a community of collaborators can take an opinionated view of what versions work together correctly in order to provide the services that Spring claims to provide. It is this combination of expertise and community collaboration that gives developers a high degree of confidence that up-versioning spring-boot will not create unexpected side-effects in any of the code within spring's jurisdiction. For dependencies outside spring-boot, a similar approach can be taken by a team or set of teams. If versions are managed centrally, and any changes are managed and tested then microservices should have a degree of confidence that the shared libraries they are depending on will provide the services they claim, without side effects.

Spring-boot projects will tend to use a spring-boot parent and so one option is to have a team parent pom that extends this. However this means the parent also determines the version of spring boot, and any update will have a large blast area. An approach that is easier to manage is the use of a team BOM, or more than one if there are logical groupings of dependencies.

The 'closest wins' approach of Maven means that any microservice can use BOMs but still override such imposed versioning in the (hopefully) rare cases that this is required. Ultimately managing dependencies just IS complex, but making the right decisions and ensuring you have adequate controls in your CI and CD is what will keep your team as nimble as possible in this complex landscape.