Event sourcing makes errors worse

Published in

Campaign Monitor Engineering

5 min readOct 11, 2016

So you’re microservicing all the things, and event stores are the all-in-one persistence and communication mechanism that’s going to make it all come together? Great! I want to talk a bit about what happens when you want to process an event, but you can’t.

Perhaps some downstream service is down, or rejecting requests due to high load. Worse, maybe the event is malformed or contains invalid data.

It’s worth a quick chat about why we would use events as our source of truth. The experts often sum it up as “turning the database inside out”. If that doesn’t make sense, take a moment to listen to Martin Kleppmann, he does a great job of explaining the concept.

What’s important here is that we’re intentionally not using a general purpose database (which under the covers contains a general purpose event store — the transaction log) so that we can implement a special purpose event store — containing events pertaining to our business domain.

The benefit is that we can build and optimise for the special case, and we don’t need to make concessions to the general case.

The downside is that we have to explicitly cover the edge cases of our specific case, rather than having them covered off through more general-purpose constructs like deadlocks, primary key violations etc.

Your shiny new event-sourced microservice system is effectively a big, distributed, special purpose database with its own special purpose transaction log.

I want to chat quickly about commands before we get to the crux of this little chat

It might seem that the only difference between commands and events is how you word them.

Commands are in the imperative: upgrade_customer_to_VIP.
Events are in past tense: customer_upgraded_to_VIP.

There’s more to it than that though. The key thing about events as opposed to commands is that you can’t say no. That thing has happened, you have to deal with it now.

Now that I’ve brainwashed you into thinking about the value of event sourcing the same way as me, let’s discuss error handling ;)

I’ll start with the worst cases: Events which are malformed or contain invalid data. In the same way that SQL rejects primary key violations, or nulls where they shouldn’t be, or strings where integers should be — you need to validate inputs. That means the event must be valid when it’s raised (within the eventual consistency constraints you build into whatever UI/API is accepting requests and generating events — that subject could be a whole separate post…).

What if it’s no longer valid a short time later, when it’s processed by a subscriber? Great question, I’m glad you asked. Let’s remember — events have already happened, you can’t say no. That means you have a few options on the table at this point:

Your subscriber effectively stops all processing on the stream of events, throws up its hand and says “I need a human please”. Not ideal at 2am.
You decide you can live with out-of-order handling for this edge case, put that event aside, and say “I need a human please”, while continuing to process events.
This has the problem that you can no longer process that event — but that’s OK because a human can apply compensating actions (probably via raising more events).
You raise another event which says that an edge case has happened, and some other subscriber fires off a bunch of events or commands to compensate.

Compensating actions could be refunding a sale we can’t process (perhaps we had stock when the order arrived, but now we don’t), cancelling an account for a credit card we couldn’t validate, etc.

I would argue that option 1 above is basically off the table, there is no good way to make that work that doesn’t involve modifying events (a big no-no, they already happened!).

Options 2 and 3 are effectively the same, the only difference is whether a human or another microservice is doing the compensating. Starting with humans is always cheaper and gives you a better chance of correctly building in the correct compensating actions. It does require you to build into your system some way to find out when a human is needed.

A quick note — this “I need a human” function might sound a lot like the dead letter queue provided by many message broker systems. There is a difference! You can’t move anything which arrives there back into the “queue” (event stream). You won’t be “moving” the problematic event off to the side, as other subscribers may have handled with no problems, and order is crucial in event sourcing. Instead, you’re electing to not do what you’d normally do for that particular event. The only way forward from an error when events are concerned is to raise more events.

Now that we’ve got the worst case out of the way, it’s quite easy to discuss downstream systems/infrastructure being unavailable.

You’ve really got similar options here — the question is how you identify “transient” errors (errors which will go away eventually, for the same operation), and how long you keep trying before you decide compensating actions are the appropriate response.

The key thing to keep in mind here is that you want to be able to distinguish between transient errors for this specific event, and transient errors where processing any event which involves using this downstream system are going to fail. That’s where circuit breakers and extensive telemetry come in handy. You’re effectively riding a fine line between slowing down the whole system and hitting your edge-case handling with a much higher volume than it normally would. Your solution will be very specific to your domain and infrastructure (as you might expect given that you’re building a distributed, purpose built database).

You may be thinking to yourself at this point:

Geez, this makes event sourcing sound quite expensive to apply, I have to model and build far beyond the happy path.

You are correct. Advocates for DDD are always quick to add “but only do it where you derive competitive advantage”. This little article is merely a glimpse into why it’s not worth doing for parts of your system which don’t drive competitive advantage.

If you want to lower your investment — use commands, or the publish/subscribe pattern and relax constraints like ordering and persistence. You can then take advantage of general purpose infrastructure for edge cases (message brokers, relational DBs etc), and spend engineering effort shoehorning your requirements into general purpose infrastructure.

Really this is a long winded way of making a single point

When events are your source of truth — the only way out of a problem is forward, with more events. In the words of the fine folks behind geteventstore.com — there is no delete.

Event sourcing makes errors worse

I want to chat quickly about commands before we get to the crux of this little chat

Now that I’ve brainwashed you into thinking about the value of event sourcing the same way as me, let’s discuss error handling ;)

Now that we’ve got the worst case out of the way, it’s quite easy to discuss downstream systems/infrastructure being unavailable.

Really this is a long winded way of making a single point

Written by Mark Green