Event-Driven Dumpsterfire
I started my career on a backend team building services that communicated through queues. While some had SOAP endpoints (if anyone remembers those), and eventually REST and gRPC endpoints, the bulk of my time was spent on all the asynchronous message passing.
I loved it. Async messaging comes with many benefits: your servers don’t need to be highly-available, you can retry messages, and you can easily protect your dependencies by putting backpressure on your message queue or bus. Then of course—my personal favorite—you usually don’t need to wake up at 3am if something breaks because the messages will still be there to process in the morning!
We were using queue technologies like RabbitMQ, and then eventually SQS, to send tasks from one system to another. Since they were intended for one consumer and were telling the consumer what to do, you could think of them like a distributed implementation of the command pattern.
Moving on to my next job, I got the chance to expand my async messaging repertoire. We were using Kafka and SNS to send messages, but with a small twist. Instead of queueing them up for one receiver, we were creating multiple streams to multiple receivers. And along with that, the receivers no longer defined the message shape and schema; instead, the publishers dictated the events, and subscribers chose which ones to listen to. I was thrilled for the chance to work in this powerful new paradigm, which is usually referred to as event-driven architecture (EDA).
Of course, power begets powerful footguns. EDA is wonderful for all the aforementioned reasons that async messaging is wonderful. It’s also great at decoupling components, which brings a myriad of other benefits. But it’s not straightforward to just start using EDA. The company made some mistakes early on that continued to bite us a few years down the road. The next company I went to was just starting to tinker with EDA, and they were making the same mistakes. After watching this unfold, contributing to the mess, then helping clean it up, I think I know why. The reason is subtle, but understanding it can help eliminate footguns before they’re loaded.
What makes EDA hard?
So what makes EDA difficult? I’ve watched two companies stumble over it, while watching three get command queueing mostly right, even though both are forms of async message passing. Why is that? I think there are two main issues an engineer has to deal with in EDA: accounting for arbitrary message ordering, and developing the right message schema.
First, message ordering is troublesome because there are no guarantees about timing. Events are published and consumed asynchronously, and if there are many events ahead of your event in line, it could take a while for the subscriber to receive the message. So subscribers should always expect events will take an arbitrary amount of time to arrive. Unfortunately, that means a system subscribed to multiple event streams can’t assume any message ordering between those streams. Within a single stream publishers can enforce an order that subscribers can rely on, but if there are separate streams from separate services, no ordering can be enforced. That introduces the possibility of race conditions if a subscriber needs data from both streams for a particular behavior.
Second, since publishers don’t know their subscribers, there aren’t clear constraints on the event schema. It’s easy to make a schema that doesn’t actually help any subscribers take action. If it doesn’t have enough fields, or the right fields, the engineers working on the subscribing services usually come knocking on the publishing service team’s door asking for changes. And since being asked to add fields is an annoying interruption, teams start including too much data to avoid being asked to add more. Event schemas almost always end up bloated, and worse than that, they frequently expose internal details that should’ve stayed encapsulated.
But wait a second, you might be thinking, don’t these issues apply to command queueing as well? There aren’t guarantees about timing with async commands, and you still need to design a message schema, so surely it could be done wrong. How is EDA harder?
That’s a good question, and it’s one that takes us into the subtle heart of the matter. The real difficulty isn’t the async nature of the messages: it’s the fact that events subvert our deeply-ingrained understanding of communication interfaces. Commands are easier because they’re very similar to a traditional client/server API model. Events, however, flip the traditional model on its head.
Commands and events as APIs
Look closely at command queueing: the commanding service knows about the service it wants to send a command to, similar to how a client knows about a server. The server has an API it exposes, allowing clients to send various requests with particular parameters, the same way a service might expose commands. The control flow points the same direction as the dependency: the client (commanding service) depends on the server (service being commanded) and the client tells the server what to do according to the API the server has provided.
With events though, the control flow points in the opposite direction of the dependency. While the server (publisher) still exposes an API of sorts—i.e. a set of event schemas representing actions on the server—and the client (subscriber) still depends on the server and consumes the API, the server is actually dictating what the client receives and therefore what it does. The client is not telling the server what to do in EDA.
Coming back to our problems—accounting for arbitrary message ordering and designing a good message schema—we can start to see why these tend to be mitigated in command queueing, but are big issues for EDA.
With the traditional client/server model, request ordering is the client’s responsibility. This can be handled easily for commands with a FIFO queue, allowing the command publisher to guarantee that messages are received in the order they’re sent. It can also be handled without a FIFO queue, by simply rejecting commands with an error if there’s something incompatible about their ordering. This allows other commands through, and then by retrying failed ones, eventually the correct ordering can be achieved. Note that both of these strategies are very similar to how we’d deal with request ordering in the synchronous client/server context; either the client waits for success responses before sending the next request (i.e. FIFO), or the client tries other requests then retries failed ones when faced with errors like 400s, 404s, or 409s from the server.
With EDA however, request ordering might still be the client’s (aka the subscriber’s) responsibility—in the sense that only the client knows what the ordering needs to be—but it cannot dictate how they arrive. Publishers might send events in order, or they might not. And if there are multiple publishers involved, there are certainly no ordering guarantees. Subscribers can’t respond with errors when messages are out of order, since the publishers don’t know who the subscribers are and aren’t listening for responses from them. This forces subscribers to track the events they’ve received, or otherwise handle ingesting partial data in an arbitrary order, which is a much more complicated chore that doesn’t have simple, established patterns.
Apart from ordering issues, there’s also the problem of making a good message schema. For a normal API, or for commands, this is simple: the action the client wants to take can be parameterized in certain ways, therefore those parameters need to be exposed. The server doesn’t need more info than that. So a command schema is simple: if the command is “Create Object”, the schema only needs the fields necessary to make that object.
That simplicity goes away when we look at it from the EDA perspective. Thanks to the inversion of control, the publishers have to supply the data any subscriber might need to take meaningful action. That leads to some hard choices about what to include, and the easiest way out of hard choices is to default to including everything and the kitchen sink in the message schemas. At first glance this also looks like the right thing to do for the sake of the subscribers, since it gives them maximal data and context. In reality though, this tends to break encapsulation, making it very hard to change the internal details of the publisher down the road without introducing breaking changes to the message schema. So we end up in a place where teams either need to make hard choices about what to include in the schema, or they avoid the hard choices and then end up with a hard migration down the road. Either way it can amount to a lot more work.
What do we do about it?
So, we see the problems: it’s hard to ingest events from multiple streams without race conditions, and it’s hard to publish events with a schema that balances providing enough data while also keeping internal details internal. We also see the problem under the problems: events subvert our typical mental model of service communication because the direction of dependence no longer determines the direction of control. It’s flipped, and so all of our preexisting strategies and patterns for solving these problems won’t work. If we rush headlong into EDA without realizing that crucial but subtle fact, we’re going to end up with a dumpsterfire a few years down the road.
I mentioned I worked at a couple places using event-driven architecture. At the first, I was definitely guilty of causing a few race conditions, and I certainly penned a bad message schema or two. But because I also witnessed the problems this causes—as I dealt with very difficult-to-reproduce bugs, and dragged my feet through a publishing service refactor—I was able to understand the real problem and come up with strategies to handle it. I helped build shared libraries for message publishing and consumption, and was privileged to co-author the company-wide RFCs for both event and command high-level schemas. That experience came with me to my next job, where I was able to help head off certain mistakes before they went too far.
I’ll cover those experiences, and the resultant strategies and patterns, in two follow-up articles: one dealing with event ordering, and another with event schemas. Stay tuned!