18 July 2021

Safety Nets Are Not There to be Used

Looking at some of the drives for improvement that tend to be given top billing, there is often an emphasis on levels of automated testing and other safety measures as part of CI/CD. These are all important but one thing that is not often stressed is the fundamental importance of engineers in the process.

Ultimately, whatever is running in production originated as lines of code written by one or more individual engineers - at no other point in time can any human or automated part of the process be expected to have the same degree or concentration of relevant knowledge about the code and the context in which it was written. Given this fact it is essential that we invest time and money in getting the best engineers and growing juniors into the best they can be and ensure they all have a real appreciation of how critical their role is - sure, mistakes will always happen, and things will change that allow bugs to emerge unexpectedly, but concentrating too much on the downstream elements of the software delivery process is like employing more and more people to place their hands over the cracks in a dam instead of finding someone who can figure out how to strengthen the dam - you can't hold back the inevitable torrent from a mass of leaks.

And what is it that we need from good engineers? It is almost certainly not their experience of writing code - frankly anyone can do that. If software engineering was just about writing algorithms to fulfil requirements you wouldn't need engineers - it would all come down to accurate and concise specification of requirements - rules for mapping input to expected output - and that is indeed a niche that is filled by low-code solutions. However these don't scale to enterprise platforms because the real engineering craft is about seeing connections, understanding unwritten implications, and that is the skill that we need to invest in - otherwise it's like taking good bricklayers and expecting them to be able to build a house - that just won't work unless they understand about lintels and beams and the implications of not having the right ones (okay sorry, I'll stop the analogies now).

So how do we get the best engineering teams? I'd say it's all about continuous practice at problem solving and fixing things - ideally most of this practice is specifically in the software landscape that they are part of, this is the knowledge and understanding they need to have, but other problem solving helps - cryptic crosswords are great for practising spotting connections and seeing past the obvious interpretations; striving to excel in sport, music and art are all things that help us learn to spot patterns, make decisions and reflect on how the implications of those decisions pan out (in real time).

When it comes to problem solving in software, this is a key benefit of a DevOps team. The alternative, a development team that just writes new code, will focus on what they need to know to carry out a task, but how do they know what they need to know? It is really hard to know enough about the far reaches of the system and become familiar with the folklore surrounding it if you are not interacting with it regularly. When the development team is responsible for ALL production issues, then each individual member is constantly being exposed to unfamiliar parts of the system - honing their skills in investigating their own and other people's code and documentation. Every team member needs to be great at tracking down and fixing bugs and that is how we discover the connections between components, the styles and reasoning that sit behind the code, the implications of what is running where and how it affects the code we are familiar with. Also, when a team is having to deal with tedious simple problems, they will be driven to figure out how to prevent those, to monitor for them, to recover from them - they learn about proactively pre-empting and anticipating problems that haven't yet happened. Rather than having a first line support team working from a playbook to deal with the same questions again and again, a team that is being bogged down by those questions will figure out how to remove them, and become a better and more knowledgeable team in the process (Okay, I admit that in our team we don't have to provide 24/7 first line support - if we did I might be happy to accept a little bit of dedicated support!)

 


 

So once the code is written what are we getting out of the other elements of our path to production - are we clear about their role? They are all important but crucially none of them are there to compensate for insufficiently good code - when problems occur in production we might look at the safety measures that have been missed from the process but the problem to really address is always one of knowledge or understanding in the team (maybe exacerbated by a belief that the process is there to protect us against mistakes).

Some might view unit tests as a first line of defence. However really they are a coding tool rather than a safety measure. They are there to confirm what the code says and their role is to help future developers understand the assumptions and expectations on which existing code is built. They are only asserting the beliefs of the engineer that wrote them, so they cannot catch mistakes in those beliefs. You can easily have lot of test coverage that asserts that your code does precisely the wrong thing. That's not going to help you - unit tests are (naturally) testing a single isolated unit of code - they are not testing the context in which that unit runs - you are still relying on the developer of both code and unit tests to understand that context.

Peer review might be seen as the next line of defence. While this may be a place to catch some code that was based on incorrect assumptions or understanding, ideally those mistakes shouldn't get that far - they should be figured out, discussed and questioned before and during the writing of code - it's why we have product owners and documentation and clearly written existing code - these things are there to help the engineer fully understand the context of what they are doing. Peer reviews can be a great opportunity for discussion and learning - especially to engender a critical and questioning culture in the team - but if they are needed to check that engineers have understood the requirement and know the coding basics then we might be squandering the valuable time and resources of the team.

Looking at automation, we have build tools and deployment gateways - code quality scans, plugins to enforce code rules, vulnerability scans, automated penetration tests, automated regression tests.

Code quality scans and enforcement rules are purely a development and learning tool - I've certainly been surprised by findings and they have driven me to read up and understand the reasoning behind them. That's really valuable for me, but I have also seen these scans used as a gateway to acceptance of code from external teams - this is a terrible idea and a sure way of letting technical debt accrue - it is very easy to write appalling and unmaintainable code that scores high on a quality scan - it measures very specific aspects of code and does that well. It knows nothing about naming, meaningful documentation, even the most basic information about business rules and what the code is hoping to achieve.

Now in vulnerability scans and penetration testing we do have some really valuable safety nets - but specifically for problems that don't originate in our code - that's fundamental - although we want engineers who have a breadth of knowledge and who are happy to delve into open source code (or decompiled third party code) and understand it, the quantity of external code in any non-trivial system makes it impossible and undesirable to know how all of it works, or even just to keep up to date with CVEs - you can only hold onto a certain amount of knowledge and we do want to leave some space for our own code. With penetration tests the same is true when we are finding flaws in external code but in this case the tests may also find flaws in our own code and I'd argue that if you are writing code that has any kind of attack surface (ie all code) you need to be aware of, and implement defences against, those attacks before finding out about them in a test result - hence, like code quality scans these are primarily learning opportunities rather than a rubber stamp for a safe release to production, especially as you are trusting the test to be based on a complete traversal of the application - hopefully it is, but are you sure?

Finally regression tests, which are often misunderstood - engendering a belief that you can add more and more cases to the regression suite and then sleep easy. Yes, regression tests should keep your highest value and highest risk areas covered, and ring alarm bells when something slips through to compromise their smooth working, but regression tests should not cover more than that, and cannot reliably do so - otherwise test maintenance will gradually become the focus of all your team resources and you have the same fundamental issue as the original code - the regression test code has to understand the context of what is happening in the system - you could put an awful lot of time into regression tests for scenarios that don't quite reflect reality, or at least the current reality.

 


 

In summary, invest in automation but always be conscious of what and how much protection each element offers - it is not a substitute for due diligence by the development team - sure, you need to spend time and money on safety nets, and best make sure they're good ones, but don't forget to hire damn good trapeze artists and make sure they're practising every day - because there's going to be an awful mess if they miss the safety nets, crash into each other, or smack straight into the pole that's keeping your whole big top standing! (and that's the last analogy, I promise)

15 February 2021

Dependency Management in Microservices with Shared Libraries

If you have a set of microservices using a common language, it is sometimes reasonable to extract commonly used behaviour into shared libraries. One of the consequences of this is an increase in the complexity of your dependency graph, and so there are some important decisions to make about how to manage versions of transitive dependencies on your own, and third party, libraries. This post specifically talks about Java or Kotlin with Maven.

If your code is going to remain fixed, and it doesn't depend on any libraries that could conceivably present any security risk, then it doesn't matter too much how you manage dependencies. However, in the real world there are two reasons that versions of transitive dependencies change - vulnerability patching and changes to functionality (either upgrading to a later version of an existing library for its new functionality, or bringing in a new library which depends on a different version of one of your existing dependencies).

Our goal in both cases (but particularly the first) should be to make version changing as simple as possible, and with as little risk of unforeseen consequences as possible.

There are three broad approaches:


  1. Manage versions from the bottom (shared libraries enforce versions of transitive dependencies)
  2. Manage versions from the top (microservices enforce versions of transitive dependencies)
  3. Manage versions from outside (versions are enforced in a separate common definition - for example the spring-boot parent)

 

At first glance the first approach sounds compelling especially if you have a large number of services - any vulnerability can be addressed in one place and will just propagate out to services from there. However there is a fallacy here and a few serious dangers.

The fallacy is that managing a version in one place simplifies things for microservices. This does not hold because the services still need to up-version their dependency on this shared library so rather than simplifying the management you have just added another level of indirection.

The dangers all relate to complexity. First, how can you be sure that one library is the only one managing that dependency? If two libraries use the same transitive dependency, which one should you choose to manage it? What if some microservices use one of those libraries and others use the other. You'd have to manage in both libraries and then what if they are out of step?

As soon as you have multiple shared libraries with the same transitive dependencies, addressing a security vulnerability starts with a time-consuming investigation to discover what is actually defining the version. Since maven 3, the verbose option of dependency:tree is broken. You can deliberately down-version with

 

  1. mvn org.apache.maven.plugins:maven-dependency-plugin:2.10:tree -Dverbose=true | less  

 

but while this tells you that a library has been dependency managed, it isn't necessarily trivial to discover which POM is actually defining that version


  1. 08:33:06,547 [INFO] |  |  +- com.github.bohnman:squiggly-filter-jackson:jar:1.3.6:compile  
  2. 08:33:06,547 [INFO] |  |  |  +- org.antlr:antlr4-runtime:jar:4.6:compile  
  3. 08:33:06,547 [INFO] |  |  |  +- (org.apache.commons:commons-lang3:jar:3.11:compile - version managed from 3.4; omitted for duplicate)  
  4. 08:33:06,548 [INFO] |  |  |  +- (com.google.guava:guava:jar:19.0:compile - omitted for conflict with 20.0)  
  5. 08:33:06,548 [INFO] |  |  |  +- (com.fasterxml.jackson.core:jackson-databind:jar:2.11.4:compile - version managed from 2.6.4; omitted for duplicate)  

 

Another serious problem is that vulnerability patching and functionality are now tied together - what happens if your microservice is depending on version 1.0.0 of a shared library, which has since evolved up to version 2.0.0 with a breaking change.  Now a vulnerability is discovered in a dependency and the shared library updates the managed dependency, releasing this as version 2.0.1. Your microservice can't up-version without addressing any backward compatibility issues, the simple route then is to dependency manage the vulnerable transitive dependency in your microservice POM - when there are multiple conflicting managed versions, maven takes a 'closest wins' strategy. However, now you are starting to create a sprawling mess of conflicting version management. Next time you need to deal with a security vulnerability, your job will be twice as hard.




The second approach is conceptually an improvement - microservices should be in control of their own versions. The reality is not so simple of course - if multiple versions of a transitive dependency are used by classes from different shared libraries at runtime, then unless you are running with OSGI one version will win, and you have to make sure that the version which wins is compatible with all usages. That is not something that can be solved by any particular strategy - it's just something that has to be managed, but do you want to figure that out and manage it in every one of your microservices? In the absence of breaking changes highest version should be the right choice (and hence that is the default behaviour in Gradle) so that maybe simplifies the decision making, but doesn't remove that overhead.

The downside of this approach is the sheer amount of dependency management that may be required. However a pragmatic 'laissez faire' approach is just to let the maven rules untangle dependencies and dependency management for all transitive dependencies until you detect a security vulnerability, and then dependency manage in the microservice at that point. The only downside is that in the presence of a large number of microservices, that can be a lot of versions being specified in a lot of places - the big downside is where microservices are not in active development the overhead of up-versioning various dependencies across many microservices can be time consuming.




The third approach addresses the downsides in both the other approaches to an extent. The main advantage is that rather than an ad-hoc wild-west set of versions being specified, there is some central control - where that can be managed by team members with the right level of expertise; and where microservices have an intelligent set of tests, this can reduce the risk of one up-version having unexpected results because of a usage that hadn't been fully considered. This is the approach taken by spring-boot, where a core of experts and a community of collaborators can take an opinionated view of what versions work together correctly in order to provide the services that Spring claims to provide. It is this combination of expertise and community collaboration that gives developers a high degree of confidence that up-versioning spring-boot will not create unexpected side-effects in any of the code within spring's jurisdiction. For dependencies outside spring-boot, a similar approach can be taken by a team or set of teams. If versions are managed centrally, and any changes are managed and tested then microservices should have a degree of confidence that the shared libraries they are depending on will provide the services they claim, without side effects.

Spring-boot projects will tend to use a spring-boot parent and so one option is to have a team parent pom that extends this. However this means the parent also determines the version of spring boot, and any update will have a large blast area. An approach that is easier to manage is the use of a team BOM, or more than one if there are logical groupings of dependencies.

The 'closest wins' approach of Maven means that any microservice can use BOMs but still override such imposed versioning in the (hopefully) rare cases that this is required. Ultimately managing dependencies just IS complex, but making the right decisions and ensuring you have adequate controls in your CI and CD is what will keep your team as nimble as possible in this complex landscape.

09 November 2020

Every Line of Code Needs Love, Care and Attention

In our team we try to deploy to production every sprint. We run a platform composed of a number of microservices (hey, don't we all now) and so as each component progressively moves forward, at some point we can see that a feature emerges in production.

It's in the nature of our platform that many of our features are inward facing and so these small deployments tend to be safe - even if something goes wrong it is not a big issue especially as we've put in the fundamental design groundwork to ensure that our components recover gracefully and generally without manual intervention. However just recently we've had a couple of features with dependencies on external teams and we have been holding off the production release of individual stories pending integration testing. This big-bang approach always has been painful and and these large external facing features brought the issue of backward compatibility into focus, especially as other features are in progress at the same time - having lost the momentum of regular releases, I found the question of whether a single component can be safely released becoming more and more difficult to answer with 100% certainty.

At this point, we hit an example of where we had got this spectacularly wrong - so here was a great opportunity to learn something.




Here's the background - our platform's components for the most part talk to each other by producing or consuming events. These events have to evolve from time to time so we have a 2 level version number in each event - top level for breaking changes, second level for non-breaking changes - eg 1.0. One of the key events in the system is produced by one component and consumed by three. Several months ago it had to be changed in a non-compatible way and so it went from version 1.0 to version 2.0, and then a few weeks ago we added some new properties and so it went up to 2.1. In theory the producing component could go live without affecting anyone as this is a non-breaking change. However as soon as we got into E1 we discovered this was not the case - 2.1 events could not be handled by the existing consumers. The result is that our producer is dependent on those other three components. It can't go live until their fixed versions go live. One of those components is outward facing, and as the external users timescales have slipped so has the release of that component and hence our event producer. Now we are in a world of pain. Other changes are backing up but can't be released. Each slippage requires discussions and meetings to understand the impact and the new order of releases. This is a dark place - we will emerge, but with scars!

So how did a backward compatible change become a breaking change? The answer is in one single line of code written a long time ago.

The three consumers of the event all share a common deserialiser library (okay, there's another discussion there but that's for another day). When the event went from 1.0 to 2.0 some custom code was added. The code that was written effectively said:

 

  1. if (event.version.equals("2.0") {  
  2.   // treat this as a 2.n style event  
  3. else {  
  4.   // treat this as a 1.n style event  
  5. }  

 

You can probably see the problem already - line 01 means that a version 2.1 event gets treated as a 1.n style event and deserialisation fails.

Now I don't know who wrote this code, nor does it matter. With hindsight it is easy to see the problem, and the fix is trivial, but months ago when this was first written, it's an easy mistake to make, and mistake is probably even a strong word - all the unit tests passed and this code worked fine for months. I'm sure I've written worse things and will do so again.




We could look at what happened, come up with a coding guideline about dealing with event versions, stick it on our internal wiki and move on, but that would be fairly futile - the same situation may not arise for months or years if ever, and it is missing the wider point.

The real thing to realise here is that although at the time of writing it was easy to miss the potential problem, it is also the case that it was entirely possible to recognise the potential problem. There are no unknowns here - looking at the line of code, the fact that it is referring to a two level version number means that implicitly we can know that at some point in the future the next thing that will happen is going to be either a minor version increment or a major version increment. So even with limited context it is clear what is inevitable, and the code could account for that without knowing exactly what the next change will be or when that will happen.

A key point is that some of the things that our code has to deal with are inevitable - for example future changes to requirements, failures when making outside calls, invalid input from outside. Every line of code that we write, however small and innocuous it may seem needs to account for the things that are inevitable, whether that involves defensive treatment of input, adding useful logging, or ensuring the way we write branching statements accounts for inevitable change. This isn't the same as over-engineering - writing complex generic code to cater for hypothetical future requirements that may never happen, this is engineering for the inevitable.

One of the main ways of achieving this is to not rush. We don't need to. Our capacity to release useful functionality is not constrained by our ability to write lines of code. Given a list of problems, any of us could write hundreds, maybe thousands of lines of code within a sprint to solve those problems. Our capacity to release functionality is more often constrained by the need to deal with the fallout from previous decisions - looking into unexpected behaviour and trying to figure out what happened and why, trying to update code that has gradually deteriorated into a ball of spaghetti, trying to figure out the order of deployments because a line of code turned out to change a backward compatible interface change into a breaking change.

What marks us out as professional engineers is not our ability to write lines of code, it is our ability to consider each line of code and understand where it sits in the present and the future - how does it live alongside existing code, how will it cope with everything that the outside world and the future throws at it. We won't always get it right but recognising this means that we can always strive to do better. We need to give those lines of code the love and attention they deserve.

26 February 2020

Why is Pausing a Kafka Consumer so Complicated

If you have any Kafka consumers it is almost certainly a good idea to be able to pause them. There are a few reasons - for example a circuit break situation where downstream processing has failed and you need to stop consuming more messages, or a rolling deployment on a horizontally scaled application, if you don't want old and new code consuming messages at the same time, so a pause across deployment is a good idea.

Fortunately the Consumer interface has both pause and resume methods, and because they are idempotent naturally you might approach pausing as follows: in my poll loop check a flag (that can be set by another thread) to see if I should be paused. If so, call pause. If not, call resume.

Then we hit a problem - the pause and resume methods take an argument, that argument is a Collection of TopicPartitions. I'm using automatic partition allocation so how do I know what to pass? Ah, no problem, there's an assignment method that returns a Collection of TopicPartitions.

Okay now my code looks great:

while (true) {
  if (paused) {
    consumer.pause(consumer.assignment())
  } else {
    consumer.resume(consumer.assignment())
  }
  handleRecords(consumer.poll(timeout))
}

Lovely, all works perfectly - until a rebalance happens while paused. Now my consumers all get exactly one batch of records. What's going on?

It doesn't take much debugging to discover that after a rebalance my consumer is no longer paused when entering the while loop, and my call to assignment() returns an empty Collection so I pause nothing. Next time round the loop, assignment() returns a Collection of the partitions I'm subscribed to, as expected, so the code works from that point on, but too late to stop getting one batch.

So what happened? First question is why was my consumer unpaused - well, this is because there is no pause flag at the consumer level - when we call pause on the consumer, that paused state is actually held by instances of TopicPartitionState indirectly referenced by the consumer, and over the course of a rebalance the existing instances associated with my consumer are ditched and new ones created - these may or may not relate to the same partitions at the broker end, but that makes no difference - they are brand new instances that know nothing about what has happened previously. Because of this, after rebalance the Consumer and the new TopicPartitionStates have no awareness that they were meant to be paused, so it's up to application code to deal with that.

Second question is, why the empty Collection - and that is simply about when the consumer knows about its assigned partitions. The call to assignment() doesn't trigger any request to the broker or consumer co-ordinator - it just reports back the current assignment. Communication with the broker really only happens in the poll method - in fact pretty much everything happens in the poll method. Unlike producers, consumer client code is single threaded [1] - when we talk about rebalance and assignment it is easy to think of these things going on in the background - but actually for any given consumer these things all happen sequentially within the poll method - between my code calling poll, and that method returning control to my code, a lot is happening - the client needs to talk to the co-ordinator to find out if anything significant is happening in the group (eg a rebalance), it needs to make sure it has assignments, it needs to make sure it has up to date metadata (so it knows which broker is leader for each of its assigned partitions), it needs to check if it knows the offsets for each partition, and look up the last committed offset if not, it needs to see if it's time to commit previously polled records if auto-commit is on - oh, and of course it needs to get hold of some records too!

So over the course of a rebalance, one call to poll will lead to the discovery that all partitions have been revoked (removing all the associated TopicPartitionState objects), and some subsequent call to poll will discover that some partitions have been assigned. That call then goes on to fetch records from those freshly assigned partitions before returning to your application code.

Fortunately this is where the ConsumerRebalanceListener interface comes into play - if you add a ConsumerRebalanceListener when subscribing your consumer to a topic, it will get called during the execution of the poll method - when a rebalance has been initiated its onPartitionsRevoked method is called, and then after the rebalance has been completed, its onPartitionsAssigned method is called - and crucially this happens BEFORE the poll method checks if it is paused, which in turn happens before fetching records. Hence all you have to do is carry out a pause check in your RebalanceListener:

override fun onPartitionsAssigned(assignment: Collection): Unit {
  if (paused) {
    consumer.pause(assignment)
  }
}

Simple when you know how, but a real gotcha when you don't. No surprise really, as the behaviour is just not intuitive. So why is it done that way?

Instinctively, it feels like pausing shouldn't need to know about the allocation of partitions - either my consumer is paused or it isn't. The Javadoc on KafkaConsumer gives some insight on why this may be a simplistic view. While I tend to have a different consumer for each topic, that is not a constraint of the Kafka Consumer, and it is reasonable to have a single consumer subscribed to multiple topics - this may in fact be quite common in streaming applications where input from several topics is processed and merged into a single output. In such cases it is possible that the consumption from one or more topics may lag behind others - this could lead to messages falling in different time windows. A reasonable way of handling that is by pausing consumption on the topics that are ahead, to allow others to catch up. While I've been thinking about partitions, it is of course a Collection of TopicPartitions that is used for pausing - each one a combination of TOPIC and partition (mind you, the Javadoc and variable names in KafkaConsumer consistently emphasise the partitions, pretty much ignoring the presence of topic)

However, this still doesn't explain why I can't just pause on topics - would there ever be a need to pause some partitions and not others? Admittedly if the producers used a very unbalanced key strategy then it might be necessary to allow consumption of busier partitions to catch up by pausing the sparser ones - the degree of granularity offered allows this. Maybe this is the reason for the design that holds pause state in TopicPartitionState, and that decision then inevitably leads to the loss of this state across a rebalance. But I would expect the need for this level of granularity to be uncommon, and while partition allocation varies across rebalances, topic subscription does not. It seems conceptually feasible to implement convenience pause methods that take in a Collection of topic names, or no arguments, and internally mark the topics as paused in the consumer - this state could then easily span rebalances - just being checked during poll at the same time that ConsumerRebalanceListener callbacks are invoked, and significantly less error prone for users who just want to pause all consumption, or consumption at the topic level.

Although I'm considering a pull request, I'm slightly uncomfortable about having two different types of pause state, transient in the TopicPartitionState and more durable at the Consumer or Fetcher level. In principle that may not be so problematic as it is internal, but worse would be the consumer interface which would appear to have 3 overloaded pause methods differentiated only by granularity, when actually the partition level method would behave significantly differently in the event of rebalance. Such semantics could be made clear in the Javadoc, but it sounds like I'd be replacing an existing unintuitive interface with one that is inconsistent. Which is the lesser of two evils?

[1] actually it isn't quite single threaded - if it was it would be pretty inefficient as the consumer would hang about doing nothing while your code processes a batch of records, and then your code would hang about doing nothing while the consumer received records from the brokers. Instead the background heartbeat thread is also pre-fetching continuously to that when the poll() method asks for the latest records it is just getting the ones that have already been fetched since the last poll. However this detail doesn't have any impact on the sequence of processing within poll, and so was glossed over in this post.

23 February 2020

Kafka Acks Configuration - an Apology

I like Apache Kafka - it's a real engineer's tool. Like an Austin Maxi [1], it's built on simple, solid, reliable foundations, it will keep running smoothly as long as you're prepared to roll up your sleeves every now and then and have a tinker under the bonnet, but even then it can still surprise you at times. 

Because I like Kafka, I want other people to enjoy using it (as long as it's the right tool for them), so I try to be helpful whenever I can, especially if they have just recently picked it up and have lots of questions. Well, I thought I was being helpful, but actually it turns out that for a while I've been casting false accusations, and lying through my teeth - I'm actually not the nice helpful person I thought I was. 

First the lies - the problem is that I see lots of teams who adopt Kafka and they really, really want to override configuration - "what should I set my consumer connections.max.idle.ms to?", "What should I use for my producer request.timeout.ms?" The trouble is, NONE of the producer or consumer configuration properties should be touched without understanding what they do and what the implications are of changing them, especially as some really need to be adjusted together - like flying a helicopter, and for that reason I ALWAYS tell people to go with the defaults - they're sensible, don't change them unless you need to. And then the accusations - I hear people say "I've lost messages, what happened? Why did Kafka lose them?". To those people I ALWAYS say - "hey, Kafka doesn't lose messages, I'll bet the problem is in your consumer code - you're consuming but then silently discarding messages or not processing them for one reason or another." 

Then our team started losing messages, and I discovered that I was wrong on both counts, for the usual reasons - assumptions and a failure to properly understand the documentation. 

My first incorrect assumption is that Kafka always guarantees at least once delivery - if you publish a message it will always be consumed at least once, regardless of what happens. The second, and closely related assumption, is that default configuration will always serve you well. 

The problem is that Kafka is flexible: if you can afford to lose a few messages, then you can use Kafka for efficient and  phenomenally high throughput, if you can't afford to lose any messages, then Kafka can handle that too but at the cost of lower throughput. Because of this flexibility, default configuration CAN'T serve both purposes - and as I discovered to my cost the default producer configuration DOES NOT guarantee at least once delivery. The reason is that small innocuous configuration - 'acks'. 

There are 3 different values (strictly 4 but one is an alias), and unlike many configurations where an understanding of client code is enough, to understand the implications of your choice on acks you really have to understand what is happening at the broker end. 

Now if you are short of time or don't like reading, I'll cut to the chase - guaranteed delivery? choose acks=all, not too fussed if you lose a few? choose acks=1, really don't care too much but just want to shove through as many messages per second as your weedy little server can possibly handle? choose acks=0. 

However, feel free to read on to look at what these really mean in a little bit more detail, and to understand the implications of your choice. 

Now when I started losing messages, after a bit of due diligence on our client code, it became clear that the problem was probably at the producer end, and acks might be the culprit. We had gone with the default setting of 1, and the documentation states - "This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost." 

My initial naive interpretation concentrated on the first half of the sentence and took it to mean that as long as the leader writes to disk we are all okay, the message isn't lost, it's on disk. So, I started to look a bit at what 'written to local log' really means. Kafka just delegates to the Operating System and Operating Systems cache data rather than persisting it straight away - flushing to disk when it makes sense. So was that the issue? The leader being shut down unexpectedly with data still in memory? Then it dawned on me that I was focusing on the wrong bit of that sentence in the documentation, writes to disk are not the important point here - the key is replication - it doesn't really matter one bit if the leader has stored the messages persistently on disk. 

So how does replication work in Kafka? Unless you perversely choose no replication (which you are free to do), each partition is hosted by multiple brokers, and however many replicas you have (3 is a pretty standard number) one broker is leader and the rest are followers. When we send messages to Kafka we only ever talk to the leader of the partition - the producer client code figures out which partition a message belongs to, and uses cached metadata to find out which broker is the leader for that partition. The followers are only there as a safety net, and they are essentially just Kafka consumers - they are continuously polling the leader, and like any consumer, each time they fetch they tell the leader their current offset (or rather the next message they want to receive). In this way the leader knows where all the followers have got to, and hence which ones are keeping up (in-sync) - by default the definition of keeping up is that they are effectively less than 10 seconds behind the leader. So, the leader knows which messages have definitely been replicated, and which ones may not have been, and as the leader is the only broker that consumer clients talk to, it can ensure that they only see messages that have been replicated. This is the definition of the high watermark - the high watermark is NOT the head of the partition, as I used to think. There are likely to be other messages ahead of it, but it is the latest message that is visible to consumers. 

Now, if a follower dies at any point, then their partition will be picked up by another broker and the cluster will get back in sync - in normal times followers aren't that important. If on the other hand the leader dies then one of the in-sync followers will be elected leader, and start servicing producers and consumers. BUT hang on - that follower might be up to 10 seconds behind the old leader, and that can cover an awful lot of messages that are no longer available. What happens about that? 

This is where acks come in - the messages have gone as far as the cluster is concerned, but acks determines whether the producers get to know about it. Actually, with no acks, the producers don't even know if their messages reached the leader - the servers may have all caught fire but those producers will cheerfully keep throwing messages into the flames. With just one ack, the producers don't care about replication, they are happy when the leader has the message. In this scenario, lots of producers may be sending messages without a care in the world, not knowing that behind the scenes followers have started to lag - maybe there's a bit of temporary network latency, or maybe the cluster is always like that. If at that point a server gets restarted - maybe it's time for some patching - then one of the in-sync followers is going to take over as leader, but the leader that has now gone may have successfully stored 9 seconds worth of messages that have not yet been fetched by the new leader, so they are gone. No consumer will ever see them, and no producer knows about it. Even if the old leader comes back to life, that 9 seconds worth of messages is useless - the world has moved on, they can't just be made available as if nothing had happened. 

With all acks, this situation won't cause any loss - the messages you sent will not be acknowledged until all followers[2] have sent a fetch request with an offset later than those messages, and if followers or the leader die, the producer client will timeout before getting that acknowledgement, and retry. The downside is that even when everything is fine, if the followers are 9 seconds behind the leader then that's how long your producer will wait for an acknowledgement. So now you might be asking incredulously if your application is going to be regularly blocked for up to 10 seconds (or more if there are elections going on at the broker end). Relax, it's not your application code that is blocked - unlike consuming, producing is asynchronous for this very reason, the producer send method returns immediately, a separate thread is continuously sending batches of messages to the brokers from a buffer. A temporary issue at the broker end will just cause the buffer to fill up a bit. If everything is really grinding to a halt, then the buffer will fill up, and send method calls will fail immediately, so you're application will know about it. And what to do about that? Well now you have a problem to solve - if there's a temporary spike that has overwhelmed the server end, you probably want to open a circuit breaker at this point and fail over to some other strategy - maybe store pending messages somewhere until the situation eases. 

So how likely is any of this? Are we worrying about something that probably won't happen? Well, the question is really - do you EVER restart a server. For resilience, Kafka spreads partitions across brokers on different servers and racks, and so each server will host a mix of different partitions, acting as follower for some and leader for others - or to put it another way EVERY server is a leader for at least one partition and so restarting a server WILL almost certainly lose messages ahead of the high watermark, whether they exist on disk or not. That is why acks=1 will pretty much inevitably cause message loss. 

[1] My first car was a beautiful Austin Maxi - I remember rebuilding the starter motor from the remains of two starter motors I found at the scrap yard - the days when cars were made to be messed about with. I also remember having to push the thing to a petrol station when the fuel gauge decided to play up - they weigh a ton! 

[2] A slight detail here is that we don't always have to wait for all the followers - min.insync.replicas is a broker end configuration that defines what is good enough for all acks, in the case where one or more followers have dropped out of sync. With 3 replicas (1 leader and 2 followers) it is common to have a minimum of 2 in sync. This means that when both followers are in-sync the leader will wait for both (even if one of them is 1 second behind and the other is 9 seconds behind). But if one follower has fallen behind far enough to be considered out of sync, the leader doesn't wait for the partition to be picked up by a new follower, and so the producer doesn't wait either - it effectively trusts that the leader and still in-sync follower won't both fail before everything gets back to normal - here lies the only risk if min.insync.replicas is lower than the total replicas.