Untangling the many aspects of EDA

With the popularity of Microservices, Kafka, and Event Sourcing, the term “Event” has become pretty overloaded and has caused much confusion about what EDA (Event-Driven Architecture) is. This confusion has led to conflating different concepts leading to unneeded technical complexity. I will shed some light on different aspects of EDA, such as Event Sourcing, Event-Carried State Transfer, and Events for Workflow.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

“Event”

If you ask people how they apply event-driven architecture, you’ll likely get many different answers. It would be best if you had specifics about how they use events, as there are many different purposes and utilities for them.

Events and EDA could be referred to when using events for state persistence, data distribution, and notifications. If you break these down further, those relate to Event Sourcing, Event Carried State Transfer, Domain Events, Integration Events, and Workflow events.

Let’s dig into all this to answer the question: “What do you mean by event?”

Event Sourcing

Event sourcing is about using events as a way of persisting state. Full stop. It has nothing to do with communication between service boundaries. It’s about state.

Check out my post Event Sourcing Example & Explained in plain English for more of a primer, or if you think you understand Event Sourcing.

Greg Young posted a snippet of a book he’s working on, in which he wanted a simple and clear definition to explain Event Sourcing.

Event Sourcing

Event Sourcing is often confused with Event Streaming, or you must be using event sourcing to use events as means to communicate with other service boundaries. Which often means using the events used in event sourcing as a form of data distribution (more on that below).

Don't integrate at the DB

How you persist state is an internal implementation detail of your service boundary. You provide public APIs or contracts to expose any state within your service boundary. Other services should not reach out directly to your service boundaries database to query or write data. We don’t do this. We expose APIs as contracts for this and version them according. State persistence is an internal implementation detail. So if you use event sourcing to persist state, the same rules still apply. You cannot have other service boundaries querying your event store directly.

Data Distribution

Events are often used as a way to distribute data to other services. This is called Event-Carried State Transfer, as the event payload contains entity-related data. While this can have utility, I’m often very concerned about distributing data.

Why do you need data from another service boundary? Most often, the answer is because of query or UI Composition purposes. If that’s the case, check out my post The Challenge of Microservices: UI Composition

If you need data from another service boundary to perform a command/operation, then realize you’re working with stale data if you have a local cache, and you will not have any consistency.

Why would we want a local cache to begin with? The route most people take to land here is they first start with publishing events to notify other service boundaries that some entity state has changed. Other service boundaries (consumers) then process these events by then making a synchronous RPC call from the publisher to get all the current data related to the entity that changed.

Because this callback to the publisher can have a lot of implications, such as latency, increased traffic, and availability, the next logical step is then to include the data in the event itself, so no callback to the publisher is required. This is why this is termed Event-Carried State Transfer.

You may have noticed that I used “Entity” a few times. This is because these types of events are often more entity-centric. As an example, they might be ProductChanged or ProductPriceChanged.

Event Carried State Transfer

This is often caused by the service itself being CRUD-driven and not task-based. If you are consuming these CURD/Entity type events, you do not really know why something changed, just that data changed related to an entity. If you consume a ProdcutUpdated event, why was it updated? Was there a price increase? You would need to infer the reason for the change without any certainty.

Check out my post Event Carried State Transfer: Keep a local cache! for more on where this is applicable and where you should avoid it.

Notifications

Events used for notifications within EDA are generally more explicit. They notify other service boundaries (or your own) that a business event has occurred. Events as notifications generally do not contain much data other than identifiers. They are used in Event Choreography or Orchestration to execute long-running business processes or workflows.

These are the types of events that relate to business concepts and are often driven by a tasked based UI—as an example, ProductPriceIncreased, ProductDiscontinued, or FlashSaleScheduled. By looking at these event names, you can tell explicitly what they are and what occurred. ProductChanged does not. These events explicitly define what has occurred and why, as they are directly related to the business capabilities your system provides.

Events used as notification can come in a couple of forms. Domain Events and Integration Events. Personally, I rather term these as Inside Events or Outside Events.

Inside Events (Domain) are within your service boundary. Outside Events (Integration) are for other service boundaries to consume. Why the distinction? Because inside events are internal implementation details about how you may communicate within a boundary. They can be versioned much differently than outside (integration) events. Once you publish an event for other service boundaries to consume, and they rely on that event and its schema, you have to deal with versioning. With inside events, your versioning strategy is much different as you control the consumers. Outside events (integration), you may have little control over the consumers.

EDA Tooling

EDA Tooling

Depending on what you’re using events for in EDA will determine what type of tooling you need. Are you event sourcing? Then you’ll want to use a database such as Event Store based around event streams that include optimistic concurrency, subscription models for projections, and more.

Are you using events as Notifications for workflows and long-running business processes? You likely want to use a queue-based broker like RabbitMQ and a messaging library like NServiceBus that facilitates many messaging patterns used when using events as notifications.

Are you using events to distribute data? For this purpose, you might want to look at event streaming platforms like Kafka.

While some tools claim they can provide all the functionality outlined here, sometimes it’s a forced issue to try and mimic the functionality—square peg, round hole type of situation.

All these different utilities for events are not an either-or. You could be using Event Sourcing without anything else. You could be using Events as notifications without Event Sourcing. You could be doing both. They all have different purposes, and understanding that help will help you so you aren’t going to shoot yourself in the foot by conflating different concepts.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Troubleshooting Kafka with 2000 Microservices

Event-Driven Architecture is excellent for loose coupling between services. But as with everything, it comes with its challenges and complexity. One of which is Troubleshooting. I will dive into issues Wix.com experienced when using Kafka with over 2000 microservices. While this might not be your scale, you’ll likely encounter the same situations when working with Kafka or Event-Driven Architecture.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Tracing

One of the common issues people face with Event-Driven Architecture is understanding the flow of events and interactions that occur. Because of the decoupling of producers and consumers, it can be difficult to understand how a business process or workflow is executed. This is especially true if you’re using Event Choreography. If you’re unfamiliar with Event Choreography, check out my post Event Choreography for Loosely Coupled Workflow.

Consumers are processing events and publishing their own events. As you can expect, this can be difficult to debug if there are any issues along the way. That’s why distributed tracing is an important aspect if you have a lot of long-running business processes that are driven by events.

Because of the asynchrony of an event-driven architecture, typically, you’d want to see when an event was published, which other services consumed it, and when. Also, did they publish their own event that was also consumed by who? In other words, you want to correlate from the very beginning all events and also have causation from one event to another.

Over the last several years, distributed tracing has become a lot easier because of tools like OpenTelmery. This allows you to have multiple different services involved in providing tracing data for an entire business process.

As an example, here’s an inbound request that starts from ASP.NET Core and then flows through a message broker to various other services (Sales, Billing, Shipping). You can also see internal tracing such as database or HTTP requests that occurred.

Zipkin

Check out my post on Distributed Tracing to discover a Distributed BIG BALL of MUD for more.

Ops

Continuing on with visualization, another common situation is wanting to have visibility into your topics. Wanting to look at messages that are published to topics, understanding where a given consumer is within the topic. While Wix.com uses Kafka, this is all very true if you’re using a queue-based broker such as RabbitMQ. You want to be able to see what messages are in a topic, and what their payloads are.

If an event has a bad/incorrect payload, either bad data or bad schema, you may need to manually fix that data so it can be consumed correctly. Having a dashboard, UI, and tooling around that is often required.

There are different tools depending on the messaging platform you’re using however I do find this space to be pretty limited still. This is why it seems most end up creating their own internal visualization, as did Wix.com

Consumer Lag

Consumer Lag occurs when events are being produced faster than you can consume them. Often there are peaks and valleys for producing events. During the valleys is when consumers can catch up so they are at or close to the most recent event published. Using a queue-based broker, you would think of this just as a queue backlog. This can happen for many reasons, one of them being the consumer cannot process an event.

The first aspect of this is knowing your topic/queue depth, processing times, latency, and other metrics. Having these types of metrics allows you to pre-emptively know when something might be going wrong and alarm.

If you’re producing 10 messages per second, and for whatever reason, you can only consume 5 messages per second, you’ll develop consumer lag that you’ll never catch back up.

With Kafka, a single partition is associated to a single consumer within a consumer group. The benefit to this is it allows ou to process messages in order from a partition. The downside to this is you don’t have competing consumers.

In the example below, there are two partitions (P1 & P2). There are three consumers. However, only one partition can have a consumer. This means that one consumer isn’t associated with any partitions.

Kafka Partitions

The benefit here is that because there’s a single consumer per partition, this allows for ordered processing. However, the trade-off is you can’t call out.

If you incorrectly produce more messages to a single partition, this is another reason why you may develop consumer lag.

Failures

As mentioned, consumer lag can also be caused by failing to process a message. If this occurs, what do you do with the message you can’t process? In a queue-based broker, you may move the message to a dead letter queue. With Kafka, you’d have to copy the message to a “Dead-Letter Topic”, and then move your cursor/index past the failing event in the topic.

However, if you’re using ordered processing, you might not be able to skip an event in the topic. Because of this, I often suggest looking at why you think you needed ordered processing. Check out my post Message Ordering in Pub/Sub or Queues, which elaborates on this more and gives examples of asynchronous workflows.

Troubleshooting Kafka

Regardless if you’re troubleshooting Kafka or a queue-based broker, or if you have 10 services or 2000 microservices, you’ll likely run into most described in the Wix.com post and this blog if you’re using an Event Driven Architecture. These are all very typical situations when working with an event-driven architecture so hopefully, they shed more light on what you’ll run into.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Scaling a Monolith with 5 Different Patterns

Want strategies for scaling a monolith application? You have many options if you have a monolith that you need to scale. If you’re thinking of moving microservices specifically for scaling, hang on. Here are a few things you can do to make your existing monolith scale. You’d be surprised how far you can take this.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Scaling Up

When referring to scaling, we’re talking about being able to do more work in a given period. For example, process more requests per second.

The most common approach to scaling is scaling up by increasing the resources related to our system. For simplicity’s sake, let’s say we have two aspects to our system that are physically independent, our application (compute) and database (data storage).

Depending on where the bottleneck is, we could scale up by increasing the resources, CPU, and Memory on our app compute.

Scaling Up Compute

However, with any bottleneck, once we alleviate it, we may affect downstream resources, in this case our Database. If we can now handle more requests in our App, we might be now overwhelming our database, in which case we would need to scale it up as well.

Scaling Up Database

All of this depends on the context of your system. Maybe it’s CPU intensive, and you need to scale up your App and your database will be fine with fewer resources. Maybe you’re more database driven and it is what needs to be scaled up. It really depends on the type of system you have and where its needs are.

Scaling Out

Another common approach is often to scale out by adding more instances of your App in front of a load balancer. Typically you’d call out your app compute because scaling your our your database is typically more difficult depending on which type of database you’re using.

Scaling Out Compute

This means you have multiple instances of the same app running, and the load balancer, typically via round robbin will distribute incoming requests to different instances. While this helps to scale, it also helps availability.

Another aspect of scaling out that isn’t mentioned as much, related to availability, is directly specific traffic to specific instances of your app.

Scaling Compute Out Per Grouping

You may choose to have different resources (CPU & memory) for different app segments. In the diagram above, the top two instances of the app may handle specific inbound requests defined by rules within the load balancer. Those instances may have CPU & Memory requirements than the instance at the bottom that handles a different set of requests.

Queues

Often when we need to do more work, that doesn’t mean it needs to be done immediately as the request comes in. Often time we can perform the work asynchronously. A good method for doing this is leveraging queues. There are probably many places you can find within an existing system where you can move the work asynchronously and out of process using a queue.

When an inbound request comes in, we can then place a message on our queue and return back to the client. The message enqueued would contain all the relevant information fro the initial request.

Queues

Asynchronously we can have the same process, or a separate process then pull that message from the queue and perform the work based on the contents of the message.

Workers

A really common example of this is anywhere you might generate and send an email in your system. Instead of sending the email when some action occurred, enqueue a message and do it asynchronously. You can return back to the client the initial request without having the email also be sent in that same request most often. It can be done asynchronously.

Read Replica

Depending on the type of database you use, scaling may be more difficult. A common approach is to add read replicas that you can then use to perform queries against. Since most applications perform more reads than they do writes. This allows you to scale out your reads to your database by introducing read replicas.

Read Replicas

Often times read replicas can be eventually consistent and there can be a lag in replicating the data from the primary database to your replicas. In these scenarios, you need to be aware of this and handle it appropriately in code if you expect to read your own write.

Materialized Views

Similar to read replicas is generating separate read models that are specialized specifically for queries. This involves pre-computing values and persisting them to a specialized read model. If you’re familiar with Event Sourcing, this is what you think of Projections as.

Materialized views, since they are pre-computed, allow you to have specialized views specific for queries. As mentioned, since most systems are more read-intensive than they are write-intensive, this allows you to optimize complex data composition ahead of time rather than doing it at the runtime of a query.

Materialized Views

Caching

First, caching isn’t easy. Check out my post The Complexity of Caching, before you go down this path. Caching is useful in reducing the load from your read replicas or primary database from those pesky queries that I keep mentioning. Similar to materialized views, you can choose to cache values that are pre-computed or in a shape that are more appropriate in a given context for a query.

Caching

Multi-Tenant

If you have a multi-tenant SaaS application, data can be siloed into its own databases per tenant. Compute can be pooled or siloed in the same way. Or you can do both and create lanes for tentats that have their own dedicated compute and databases. There are many different options to consider. Check out my post Multi-tenant Architecture for SaaS.

Multi-Tenant

Mix & Match

You have a lot of options when it comes to scaling a monolith. It’s not just about scaling up, you can also scale out differently from your compute and underlying database. Moving work and process out of process using a queue and creating materialized views or caching along with using read-replicas. Depending on your context you may choose to employ different techniques or possibly all of them depending on the size of your system.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design