BEWARE of Consumer Lag!

An essential aspect of Event-Driven Architecture is understanding how your system is performing (throughput). Everything is running smooth, and services are publishing and consuming events, and then out of nowhere, one service starts failing or has a significant decrease in throughput, which then causes havoc to your system. Let me explain some of those reasons and why having metrics and alarms will allow you to proactively make changes to keep everything running smoothly.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Consumer Lag / Queue Length

If you’re using a typical queue-based broker, the queue length is the number of pending messages in your queue that need to be processed. The point of a queue is to have some length to it; however, where things go wrong when your producing more messages than you can consume.

Queue

In the example above, a single consumer handles messages from a queue. It can only process one message at a time. It must complete the processing of one message before it can consume the next. If you produce messages faster than you consume, you’ll backlog the queue.

Consumer Lag

There are peaks and valleys to processing messages. You may be producing more messages in bursts, which is what queues are great for. However, over a longer duration, if you continue on average produce more messages than you can consume, you’ll never be able to catch up and backlog the queue.

Scale-Out

To increase throughput and reduce consumer lag, one option is to scale out using the competing consumers’ pattern. This means adding more instances of your consumer so you can process messages concurrently.

Competing Consumers

In the example above, there is a consumer that is free to process the next message in the queue, even though the other consumer is processing a message.

Competing Consumers

This means with two consumer instances, we can process two messages concurrently and doubling our processing.

However, one thing to notice here is since the queue was acting as a bottleneck, which is the point, we need to maintain a level of throughput so we don’t end up never being able to catch up. However, when we add more instances of our consumers, we could affect other downstream services or infrastructure. In this case, it’s our database.

Downstream Services

This means we are also going to increase the load on our database could have a negative impact on it. Be aware of moving the bottleneck and its implications on the rest of your system.

Processing Time

Another option to increase throughput is increasing the processing time. Meaning when we process a message, if it currently takes 200ms and we make optimizations that result in the total processing time of a message being 100ms, we just double our throughput.

The exact opposite can happen, however, which can cause our processing time to increase, resulting in our queue length growing. One common reason for this is external systems.

External Services

Let’s say we need to make an HTTP call to an external service. On average, it takes 100ms to complete the request. If all of a sudden there are availability issues and our call fails? How are we handling that? What if the external service is timing out, and we don’t have any timeouts built in place, and it takes 60 seconds?

External Services

This would be a total disaster and cause our queue length to increase alarmingly if our production rate was say, 5 per second.

Partitions and Ordering

One of the reasons I find people use Kafka is because of partitions it’s designed to have a single consumer within a group be associated with a partition. This means that can process messages from a partition in order. This also means you cannot apply the competing consumers’ pattern by partition. In other words, if you have 2 partitions and 3 consumers, only 2 of them will actually be doing any work.

Kafka Partitions

In the example above, my topic has two partitions (P1 and P2). I’ve created two consumers within a consumer group (P1 Consumer and P2 Consumer).

When a message is placed on the P1, only the P1 Consumer will process it. As you might guess, what happens when any of the above scenarios happen? We’re going to experience consumer lag because more messages are be published to P1 than can be consumed.

If you’re using Kafka and partitions specifically because of ordered processing, this means if you have a poison message that you can’t consume, you’ll experience consumer lag.

Kafka Partitions

At this point, you can other discard the message if possible and continue, or you’ll have to block and wait till you can resolve the issue with the poison message.

Monitoring

Monitoring is critical to message-driven systems using queue-based brokers or event-streaming platforms. Having metrics around your throughput and alarming when you start experiencing any type of consumer lag or queue length.

Metrics around queue length, processing times, throughput, and critical time are all metrics that should be monitored, and appropriate alarms when thresholds are reached. Be proactive and understand how you plan on increasing throughput.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Troubleshooting Kafka with 2000 Microservices

Event-Driven Architecture is excellent for loose coupling between services. But as with everything, it comes with its challenges and complexity. One of which is Troubleshooting. I will dive into issues Wix.com experienced when using Kafka with over 2000 microservices. While this might not be your scale, you’ll likely encounter the same situations when working with Kafka or Event-Driven Architecture.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Tracing

One of the common issues people face with Event-Driven Architecture is understanding the flow of events and interactions that occur. Because of the decoupling of producers and consumers, it can be difficult to understand how a business process or workflow is executed. This is especially true if you’re using Event Choreography. If you’re unfamiliar with Event Choreography, check out my post Event Choreography for Loosely Coupled Workflow.

Consumers are processing events and publishing their own events. As you can expect, this can be difficult to debug if there are any issues along the way. That’s why distributed tracing is an important aspect if you have a lot of long-running business processes that are driven by events.

Because of the asynchrony of an event-driven architecture, typically, you’d want to see when an event was published, which other services consumed it, and when. Also, did they publish their own event that was also consumed by who? In other words, you want to correlate from the very beginning all events and also have causation from one event to another.

Over the last several years, distributed tracing has become a lot easier because of tools like OpenTelmery. This allows you to have multiple different services involved in providing tracing data for an entire business process.

As an example, here’s an inbound request that starts from ASP.NET Core and then flows through a message broker to various other services (Sales, Billing, Shipping). You can also see internal tracing such as database or HTTP requests that occurred.

Zipkin

Check out my post on Distributed Tracing to discover a Distributed BIG BALL of MUD for more.

Ops

Continuing on with visualization, another common situation is wanting to have visibility into your topics. Wanting to look at messages that are published to topics, understanding where a given consumer is within the topic. While Wix.com uses Kafka, this is all very true if you’re using a queue-based broker such as RabbitMQ. You want to be able to see what messages are in a topic, and what their payloads are.

If an event has a bad/incorrect payload, either bad data or bad schema, you may need to manually fix that data so it can be consumed correctly. Having a dashboard, UI, and tooling around that is often required.

There are different tools depending on the messaging platform you’re using however I do find this space to be pretty limited still. This is why it seems most end up creating their own internal visualization, as did Wix.com

Consumer Lag

Consumer Lag occurs when events are being produced faster than you can consume them. Often there are peaks and valleys for producing events. During the valleys is when consumers can catch up so they are at or close to the most recent event published. Using a queue-based broker, you would think of this just as a queue backlog. This can happen for many reasons, one of them being the consumer cannot process an event.

The first aspect of this is knowing your topic/queue depth, processing times, latency, and other metrics. Having these types of metrics allows you to pre-emptively know when something might be going wrong and alarm.

If you’re producing 10 messages per second, and for whatever reason, you can only consume 5 messages per second, you’ll develop consumer lag that you’ll never catch back up.

With Kafka, a single partition is associated to a single consumer within a consumer group. The benefit to this is it allows ou to process messages in order from a partition. The downside to this is you don’t have competing consumers.

In the example below, there are two partitions (P1 & P2). There are three consumers. However, only one partition can have a consumer. This means that one consumer isn’t associated with any partitions.

Kafka Partitions

The benefit here is that because there’s a single consumer per partition, this allows for ordered processing. However, the trade-off is you can’t call out.

If you incorrectly produce more messages to a single partition, this is another reason why you may develop consumer lag.

Failures

As mentioned, consumer lag can also be caused by failing to process a message. If this occurs, what do you do with the message you can’t process? In a queue-based broker, you may move the message to a dead letter queue. With Kafka, you’d have to copy the message to a “Dead-Letter Topic”, and then move your cursor/index past the failing event in the topic.

However, if you’re using ordered processing, you might not be able to skip an event in the topic. Because of this, I often suggest looking at why you think you needed ordered processing. Check out my post Message Ordering in Pub/Sub or Queues, which elaborates on this more and gives examples of asynchronous workflows.

Troubleshooting Kafka

Regardless if you’re troubleshooting Kafka or a queue-based broker, or if you have 10 services or 2000 microservices, you’ll likely run into most described in the Wix.com post and this blog if you’re using an Event Driven Architecture. These are all very typical situations when working with an event-driven architecture so hopefully, they shed more light on what you’ll run into.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Don’t Fail Publishing Events!

Consistency is critical when working with an event-driven architecture. You must ensure that when you make state changes to your database, the relevant events you want to publish are published. You can’t fail publishing events. Events are first-class citizens, and when events drive workflows and business processes, they rely on this consistency between state changes and published events.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Atomicity

Here’s an example of placing an order. At the very bottom of this method, we publish an OrderPlaced event and then save our state changes to the database.

These last two lines are problematic. The first reason this is an issue is that we have a race condition. We could process the event before the stat is saved to the database. This is more true if there was any code between publishing and saving the database changes.

To illustrate this, on line 24, we publish the event to the message broker.

Race Condition

Once the event is published to the broker, we could have a consumer immediately process that message.

Race Condition

If this consumer was within the same logical boundary as the publisher, let’s say to send out the confirmation email, it might reach out to the database to get the order details.

Race Condition

But since we haven’t yet saved our database changes, the order won’t exist yet in our database.

Finally, line 25 is when the order is saved to our database.

Race Condition

But what happens if saving the order to our database (line 25) fails? Now we’ve published an event (which is a fact) that an order was placed, but really, an order wasn’t placed because we didn’t save it.

If we have downstream services that are apart of the workflow, this is misleading and could have many different implications and failures in different services.

We need to flip the two lines around to avoid a race condition and not publish an event without saving the order.

We still have an issue. Suppose we save the order and fail to publish the OrderPlaced event to our broker. If we have downstream services that are part of a workflow, they’ll never know an order was placed.

We can’t fail to publish an event if we have a state change.

Fallbacks

One solution is to have a fallback. If we can’t publish to the message broker, we have another place to store the event durably.

In my post on McDonald’s Journey to Event-Driven Architecture, they used a fallback for this.

Fallback Storage

In the example with McDonald’s, they used DynamoDB as their fallback storage. So if they could not publish to their message broker, they would save the event in DynamoDB. I also reviewed Wix.com – 5 Event Driven Architecture Pitfalls, where they used AWS S3 to save events in case of this type of failure.

From there, you’d have some retry mechanism that would pull the data from your durable storage and have it try and publish to your broker.

Fallback Retry

As an example, you could use a retry and fallback policy.

The downside with a fallback is you have no guarantee that you’ll even be able to save that event to durable storage if there’s a failure to publish the event to your broker. There’s no guaranteed consistency.

Outbox Pattern

Another solution, which I’ve talked about before, is the Outbox Pattern: Reliably Save State & Publish Events.

This allows you to make state changes to your database and save the event to the same database within the transaction. Your event data would be serialized in an “outbox” table/collection within the same database as your business data.

Outbox

Then you have a separate process that reads that “outbox” table/collection and deserializes it into an event.

Then it can publish that event to the message broker. If there are any failures in publishing the event, the publisher would simply keep retrying.

Outbox Publisher

Once the event is successfully published, the publisher must update the database to mark that event in the outbox table as being published.

Outbox Publisher

If there is a failure to update the outbox table, this will result in the publisher publishing the same event more than once, which requires consumers to be idempotent.

The downside to the outbox pattern is your adding more load to your primary database since the publisher.

Workflows

There are also workflow engines that provide guarantees of the execution of parts of a workflow. Let’s say workflow with 3 distinct activities: Create Order, Publish OrderPlaced Event, and Send Confirmation Email.

Each one of these activities is executed independently in isolation and the first to execute would create and save our order to the database.

Workflow

After the create Order activity completes, the Publish Event activity will execute to publish the OrderPlaced event to our broker.

Workflow

Now, if there’s a failure to publish the event, this activity could retry or have various ways to handle this failure depending on your tooling. Once the activity succeeds, it moves to the next which could send out the confirmation email.

Workflow

The key is that each activity is guaranteed to run. If the Create Order activity is completed, our Publish Event will execute. This eliminates the need for a fallback or an outbox.

Your Mileage May Vary

Two simple lines of code can have a large impact on the consistency of your system. Don’t fail publishing events! As you can see there are different ways to handle reliably publishing events and saving state changes, and which you choose will depend on your context. Hopefully, you can see the trade-offs for each and which will fit best for you.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design