Taavi's Blog


Event Architecture, a first take

2014-06-21T16:15:00-0400 | categories: programming

Events

Things don't matter unless they happen. At its core, everything is reactive.

This isn't new. Martin Fowler's written about event-based systems, and LinkedIn has a fantastic blog post about The Log: What every software engineer should know about real-time data's unifying abstraction. What follows is just my take on it from a RESTful web API that I worked on which evolved this way. We didn't even get as far as distributed logs or even having a message broker; we just had an in-memory list with events. Yet even that most basic implementation made our lives easier and documentation (for stakeholders) better.

An example in payments

More than just "things happening", it makes sense to think of a resource's life cycle. This can be embodied by a "status" field which defines its current state, which also implies any pending work.

Let's pretend we're doing credit card payments. They go through various processes: an auth, maybe later a capture, and possibly a refund. Really, we have 4 resources then:

  1. Payment (a container for the other 3)
  2. Auth (a payment probably always has an auth that either failed or succeeded)
  3. Captures (zero or more; a capture could fail upstream for some reason)
  4. Refunds (zero or more; same as captures)

When the user creates a Payment, it should contain all the information needed to continue processing to a sensible resting place. In an imperative situation, perhaps you'd accept a payload, then talk to your upstream payment provider, and then eventually save their response to a local database.

LMETFY (let me eventify that for you)

In an event model, all a controller has to do is accept the Payment request, leave it as created, and let the cascade of events and handlers do the rest. On creating the Payment (and persisting it to disk, possibly as part of an uncommitted database transaction), you'd emit a PaymentCreated event. As far as an HTTP controller is concerned, its job is done now, except for allowing any interested handlers do their jobs before replying with the final result. We could have a PaymentCreatedHandler which listens for PaymentCreated, and creates an Auth. The Auth would emit an AuthCreated event which the AuthCreatedHandler would receive, and actually go out to the payment provider and do the auth. If it fails to return, we can notice that the Auth has been sitting in the created state for a while, and we can try to figure out why. But under normal circumstances the upstream auth would succeed or fail. When the handler gets this response it persists it into the Auth record which would emit an AuthSucceeded or AuthFailed event. In this simple system there probably isn't anything listening for either event, so no further work occurs. Maybe the Payment wants to update an aggregate status though, so it could listen for AuthSucceeded and AuthFailed

So far I've described events and handlers, but there's a third useful abstraction lurking, and we ended up calling it an Archiver. You might also know it as a Repository (from Eric Evans' book on Domain Driven Design) or as a Data Access Object (DAO). The Archiver's responsibility is to ensure that the domain objects' state changes appropriately, and that the relevant events are emitted on those changes.

In its simplest form, this event architecture boils down to:

+---------+
|         |
|         V
|  +-------------+
|  E    Event    E
|  +-------------+
|         |
|         V
|  +-------------+
|  H   Handler   H
|  +-------------+
|         |
|         V
|  +-------------+   +--------------+
|  A  Archiver   A<--C  Controller  C
|  +-------------+   +--------------+
|         |
+---------+
Event
Represents a thing that happened, usually a state change in a domain object.
Handler
Listens for one or more kinds of events and decides what to do about them. Handlers should only listen for Events, and talk to Archivers.
Archiver
Responsible for managing domain objects: what state changes are allowed, and emitting the appropriate events when those state changes occur.
Controller
Could be a typical web app controller, but could also be a cron job or other batch processor. Should only talk to Archivers. In the case of a cron job, that might mean having a domain object to represent a run of a given job, which would fire an event when it's created, thus kicking off a cascade of work…

This seems like a lot of work for such a simple example. But as usual, the Real World has a few twists in store for us (and we never know them all up-front). It turns out our upstream provider sometimes does fraud checks on auths, so they could return pending_fraud_check. In that case, the AuthCreatedHandler would just persist the pending response, emit an AuthPendingFraud event, and things would likely be done. At some time later, though, the upstream payments provider will come back with an affirmative or negative answer to our request to do an auth. That would come from a new request, but our events don't care. When the Auth status is updated, it would emit an AuthSucceeded or AuthFailed event just as if it had occurred in the synchronous case. And therein lies the magic. Our payments won't exist in a vaccuum. There might be a shipment that needs to wait for the payment to go through, or this might be a payment preceeding a disbursement of funds. Regardless of why an Auth went on to succeeded, most parts of the system only care that it happened.

We found that breaking work up into smaller, event-triggered pieces of logic made each bit easier to understand and test. They also tended to map directly from our business stories, which led to a few stories taking a lot less time than we'd expected. "Oh, that…just works!"

As a bonus to wiring up all the communications in the system in such a structured way, we extracted graphs of how an event could cascade through the system, and generated graphviz diagrams in the documentation to reflect what the code could actually do. Because a handler could have one of several actions based on an event (it could even choose to ignore it), these diagrams don't show what the system will do, just what it could do. We didn't get as far as providing "before" and "after" graphs showing the effect of a code change, but it was on my mind.