Don’t Fail Publishing Events!

Consistency is critical when working with an event-driven architecture. You must ensure that when you make state changes to your database, the relevant events you want to publish are published. You can’t fail publishing events. Events are first-class citizens, and when events drive workflows and business processes, they rely on this consistency between state changes and published events.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Atomicity

Here’s an example of placing an order. At the very bottom of this method, we publish an OrderPlaced event and then save our state changes to the database.

These last two lines are problematic. The first reason this is an issue is that we have a race condition. We could process the event before the stat is saved to the database. This is more true if there was any code between publishing and saving the database changes.

To illustrate this, on line 24, we publish the event to the message broker.

Race Condition

Once the event is published to the broker, we could have a consumer immediately process that message.

Race Condition

If this consumer was within the same logical boundary as the publisher, let’s say to send out the confirmation email, it might reach out to the database to get the order details.

Race Condition

But since we haven’t yet saved our database changes, the order won’t exist yet in our database.

Finally, line 25 is when the order is saved to our database.

Race Condition

But what happens if saving the order to our database (line 25) fails? Now we’ve published an event (which is a fact) that an order was placed, but really, an order wasn’t placed because we didn’t save it.

If we have downstream services that are apart of the workflow, this is misleading and could have many different implications and failures in different services.

We need to flip the two lines around to avoid a race condition and not publish an event without saving the order.

We still have an issue. Suppose we save the order and fail to publish the OrderPlaced event to our broker. If we have downstream services that are part of a workflow, they’ll never know an order was placed.

We can’t fail to publish an event if we have a state change.

Fallbacks

One solution is to have a fallback. If we can’t publish to the message broker, we have another place to store the event durably.

In my post on McDonald’s Journey to Event-Driven Architecture, they used a fallback for this.

Fallback Storage

In the example with McDonald’s, they used DynamoDB as their fallback storage. So if they could not publish to their message broker, they would save the event in DynamoDB. I also reviewed Wix.com – 5 Event Driven Architecture Pitfalls, where they used AWS S3 to save events in case of this type of failure.

From there, you’d have some retry mechanism that would pull the data from your durable storage and have it try and publish to your broker.

Fallback Retry

As an example, you could use a retry and fallback policy.

The downside with a fallback is you have no guarantee that you’ll even be able to save that event to durable storage if there’s a failure to publish the event to your broker. There’s no guaranteed consistency.

Outbox Pattern

Another solution, which I’ve talked about before, is the Outbox Pattern: Reliably Save State & Publish Events.

This allows you to make state changes to your database and save the event to the same database within the transaction. Your event data would be serialized in an “outbox” table/collection within the same database as your business data.

Outbox

Then you have a separate process that reads that “outbox” table/collection and deserializes it into an event.

Then it can publish that event to the message broker. If there are any failures in publishing the event, the publisher would simply keep retrying.

Outbox Publisher

Once the event is successfully published, the publisher must update the database to mark that event in the outbox table as being published.

Outbox Publisher

If there is a failure to update the outbox table, this will result in the publisher publishing the same event more than once, which requires consumers to be idempotent.

The downside to the outbox pattern is your adding more load to your primary database since the publisher.

Workflows

There are also workflow engines that provide guarantees of the execution of parts of a workflow. Let’s say workflow with 3 distinct activities: Create Order, Publish OrderPlaced Event, and Send Confirmation Email.

Each one of these activities is executed independently in isolation and the first to execute would create and save our order to the database.

Workflow

After the create Order activity completes, the Publish Event activity will execute to publish the OrderPlaced event to our broker.

Workflow

Now, if there’s a failure to publish the event, this activity could retry or have various ways to handle this failure depending on your tooling. Once the activity succeeds, it moves to the next which could send out the confirmation email.

Workflow

The key is that each activity is guaranteed to run. If the Create Order activity is completed, our Publish Event will execute. This eliminates the need for a fallback or an outbox.

Your Mileage May Vary

Two simple lines of code can have a large impact on the consistency of your system. Don’t fail publishing events! As you can see there are different ways to handle reliably publishing events and saving state changes, and which you choose will depend on your context. Hopefully, you can see the trade-offs for each and which will fit best for you.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Wix.com – 5 Event Driven Architecture Pitfalls!

Wix.com migrated from a request-reply RPC style system to an event driven architecture and, not surprisingly, ran into a few issues. One of the developers wrote a blog post outlining five event driven architecture pitfalls they experienced. Here’s my review of that post, and hopefully sheds more light on their problems and solutions.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Reliable Publishing

When using an event driven architecture, you’ll be publishing events as a communication mechanism to other parts of your system. You’re telling other parts of your system that something occurred. That “something” is generally that some state change or side effect has occurred.

Other parts of your system can then become dependent on events being published when certain things occur, mainly if various parts of your system are used in a workflow or business process driven by events.

For example, when a payment is processed in the Payment Service, a PaymentCompleted event is published to Kafka. The Inventory Service consumes the PaymentCompleted event and decreases inventory levels based on the Order.

What happens if you make a state change to MySQL, but fail to publish an event to Kafka?

In their example, they process a payment and persist it in MySQL, but it fails to publish the PaymentCompleted event. This means that now the inventory is inconsistent with paid Orders.

One solution to this is using the Outbox Pattern. I’ve covered it in another blog post, but the gist is that you persist your events with your business state in the same transaction into your primary database. Then separately, often in another process or thread, you publish the event. If the event is published successfully, you then delete that event from your primary database.

Another option they chose is to have separate durable storage for the events in case of a failure to publish to Kafka. Then you would publish the events from that fallback durable storage. It’s a similar concept, except it’s not guaranteed since saving state and your event to separate durable storage isn’t atomic (no distributed transaction).

Event Sourcing

One widespread misconception is that Event Sourcing involves using the events as a mechanism for state and for communicating with other service boundaries. Conflating these two ideas can cause a whole lot of complexity.

Event Sourcing is about using events as a way to persist state. Using events that represent state transitions. This has nothing to do with publishing these events as a mechanism for communication with other services.

Events in Event Sourcing are implementation details within a single service boundary. They are internal.

Event Driven Architecture Pitfalls

This means you can choose to use event sourcing and not publish events for other services to consume.

You could also choose not to use event sourcing for any service and publish events for other service boundaries to decouple.

Don’t conflate the two concepts of state and communication.

Distributed Tracing

Another challenge, which is getting better over recent years, especially with OpenTelemery, is a visualization of a workflow when in an even driven architecture.

It isn’t easy to understand all the different services involved when you’re decoupled through publish/subscribe. The entire point is decoupling, which makes it difficult to see the causation and correlation. You have services consuming events and publishing events.

When event choreography is involved, it can be challenging to see the start and end of a workflow. What if something failed mid-way through? How do you know some business process isn’t completed or is in a “hanging” state? You need visibility. Check out my post on Distributed Tracing using OpenTelemery and Zipkin.

Claim Check

Large messages aren’t good. They can be a problem because they can overwhelm your broker or event log, such as Kafka. Meaning you don’t want to have to transfer large message payloads over the wire for every consumer from the broker. Generally, you want to keep event/message payloads small, but how would you do that if you have a message that contains a large image?

The Claim Check Pattern solves this by having the message/event reference where the full contents are.

As an example, a large image may be persisted in blob storage. The event/message will contain an identifier that the consumer will use to know where to locate the file in blob storage. This way, the consumer can retrieve the large payload (image) from blob storage rather than from the message itself.

Check out my post on the Claim Check Pattern for more.

Idempotent Consumers

Duplicate events will occur. This means that consumers need to be prepared that might consume the same event more than once. There are various reasons for this happening, including a different event with the same payload published. Another reason can be the Outbox Pattern mentioned above.

Using my outbox pattern example, if the PaymentCompleted event is consumed by the Inventory service more than once, it will deplete the inventory levels more than they should.

You want your consumers to be idempotent. You want to handle the same event without having a negative side effect.

How you implement this greatly depends on the types of events you publish. If you’re publishing Change Data Capture (CDC) or “Entity Changed” events, you’d want to have a versionId on each event that indicates which version of the entity was when the event was published. This way, consumers can keep track of which version they have and only process the event if it’s newer than their current version.

I generally try to avoid these style events and focus more on domain events involved in workflow. A unique ID associated with every event can be tracked to know if you’re processing an event more than once.

Check out my post on creating Idempotent Consumers for more.

Event Driven Architecture Pitfalls

While Event Driven Architecture is a great way to build a robust system that is decoupled, it has a lot of gotchas and pitfalls that you need to be aware of. Hopefully, this post provides some more insights so you don’t have to figure it out all on your own! All of the problems you’ll run into have solutions/patterns have are well-established and have been around for a long time.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Do you want to use Kafka? Or do you need a Queue?

Do you want to use Kafka? Or do you need a message broker and queues? While they can seem similar, they have different purposes. I’m going to explain the differences, so you don’t try to brute force patterns and concepts in Kafka that are better used with a message broker.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Partitioned Log

Kafka is a log. Specifically a partitioned log. I’ll discuss the partition part later in this post and how that affects ordered processing and concurrency.

Kafka Log

When a producer publishes new messages, generally events, to a log (a topic with Kafka), it appends them.

Kafka Log

Events aren’t removed from a topic unless defined by the retention period. You could keep all events forever or purge them after a period of time. This is an important aspect to note in comparison to a queue.

With an event-driven architecture, you can have one service publish events and have many different services consuming those events. It’s about decoupling. The publishing service doesn’t know who is consuming, or if anyone is consuming, the events it’s publishing.

Consumers

In this example, we have a topic with three events. Each consumer works independently, processing messages from the topic.

Consumers

Because events are not removed from the topic, a new consumer could start consuming the first event on the topic. Kafka maintains an offset per topic, per consumer group, and partition. I’ll get to consumer groups and partitions shortly. This allows consumers to process new events that are appended to the topic. However, this also allows existing consumers to re-process existing messages by changing the offset.

Just because a consumer processes an event from a topic does not mean that they cannot process it again or that another consumer can’t consume it. The event is not removed from the topic when it’s consumed.

Commands & Events

A lot of the trouble I see with using Kafka revolves around applying various patterns or semantics typical with queues or a message broker and trying to force it with Kafka. An example of this is Commands.

There are two kinds of messages. Commands and Events. Some will say Queries are also messages, but I disagree in the context of asynchronous messaging.

Commands are about invoking behavior. There can be many producers of a command. There is a required single consumer of a command. The consumer will be within the logical boundary that owns the definition/schema of the command.

Sending Commands

Events are about notifying other parts of your system that something has occurred. There is only a single publisher of an event. The logical boundary that publishes an event owns the schema/definition. There may be many consumers of an event or none.

Publishing Events

Commands and events have different semantics. They have very different purposes, and how that also pertains to coupling.

Commands vs Events

By this definition, how can you publish a command to a Kafka topic and guarantee that only a single consumer will process it? You can’t.

Queue

This is where a queue and a message broker differ.

When you send a command to a queue, there’s going to be a single consumer that will process that message.

Queue Consumer

When the consumer finishes processing the message, it will acknowledge back to the broker.

Queue Consumer Ack

At this point, the broker will remove the message from the queue.

Queue Consumer Ack Remove Message

The message is gone. The consumer cannot consume it again, nor can any other consumer.

Consumer Groups & Partitions

Earlier I mentioned consumer groups and partitions. A consumer group is a way to have multiple consumers consume from the same topic. This is a way to concurrently scale and process more messages from a topic called the competing consumer pattern.

A topic is divided into partitions. Events are appended to a partition within a topic. There can only be one consumer within a consumer group that processes messages from a partition.

Meaning you will process messages from a partition sequentially. This allows for ordered processing.

As an example of the competing consumers pattern, let’s say we have two partitions in a topic. Each partition right now is a single event in each. We have two consumers in a single consumer group. We’ve defined that the top consumer will consume from the top partition, and the bottom consumer will consume from the bottom partition.

Kafka Partition

This means that each consumer within our consumer group can process each message concurrently.

Kafka Partition

If we publish another message to the top partition, this means the top consumer again is the one responsible for consuming it. If it was busy processing another message, the bottom consumer, even if it’s available, will not consume it. Only the top consumer is associated with the top partition.

This allows you to consume messages in order, so as long as you associate them to the same partition.

In contrast, the competing consumers’ pattern with a queue works slightly differently as we don’t have partitions.

If we have two messages in a topic, and we have two consumers within a single consumer group.

Competing Consumers

Messages are consumed by any free/available consumer. Because there are two free consumers, both messages will be consumed concurrently.

Competing Consumers

Even though messages are distributed FIFO (First-in-First-Out), that doesn’t mean we will process them in order.

Why does this matter? With Kafka partitions, you can process messages in order. Because there is only a single consumer within a consumer group associated with a partition, you’ll process them one by one. This isn’t possible with queues. The downside is that if you publish messages to a partition faster than you can consume them, you can end up in a backlog disaster.

Kafka or Message Broker Queues & Topics?

Hopefully, this post (and video) illustrated some of the differences. The primary issue I’ve come across is people using Kafka but trying to apply patterns and concepts (commands, competing consumers, dead letter queues) that are typical with a message broker using queues and topics, but it just doesn’t fit.

Typically when you’re creating an asyncronous workflow, you’re consuming events and sending commands. While you technically can create a topic for commands, you can’t guarantee there won’t be more than a single consumer. Is this a big deal? To me, semantics matter. If you’re already using Kafka and don’t want to introduce another piece of infrastructure like a queue/message broker, then I understand the reasoning for doing so.

Understanding the differences in how the competing consumers pattern works. If you’re not configured correctly and are publishing to a single partition, then you can’t increase throughput by adding another consumer to a consumer group.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

You also might like

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design