Taavi's Blog


Event Architecture, a first take

2014-06-21T16:15:00-0400 | categories: programming

Events

Things don't matter unless they happen. At its core, everything is reactive.

This isn't new. Martin Fowler's written about event-based systems, and LinkedIn has a fantastic blog post about The Log: What every software engineer should know about real-time data's unifying abstraction. What follows is just my take on it from a RESTful web API that I worked on which evolved this way. We didn't even get as far as distributed logs or even having a message broker; we just had an in-memory list with events. Yet even that most basic implementation made our lives easier and documentation (for stakeholders) better.

An example in payments

More than just "things happening", it makes sense to think of a resource's life cycle. This can be embodied by a "status" field which defines its current state, which also implies any pending work.

Let's pretend we're doing credit card payments. They go through various processes: an auth, maybe later a capture, and possibly a refund. Really, we have 4 resources then:

  1. Payment (a container for the other 3)
  2. Auth (a payment probably always has an auth that either failed or succeeded)
  3. Captures (zero or more; a capture could fail upstream for some reason)
  4. Refunds (zero or more; same as captures)

When the user creates a Payment, it should contain all the information needed to continue processing to a sensible resting place. In an imperative situation, perhaps you'd accept a payload, then talk to your upstream payment provider, and then eventually save their response to a local database.

LMETFY (let me eventify that for you)

In an event model, all a controller has to do is accept the Payment request, leave it as created, and let the cascade of events and handlers do the rest. On creating the Payment (and persisting it to disk, possibly as part of an uncommitted database transaction), you'd emit a PaymentCreated event. As far as an HTTP controller is concerned, its job is done now, except for allowing any interested handlers do their jobs before replying with the final result. We could have a PaymentCreatedHandler which listens for PaymentCreated, and creates an Auth. The Auth would emit an AuthCreated event which the AuthCreatedHandler would receive, and actually go out to the payment provider and do the auth. If it fails to return, we can notice that the Auth has been sitting in the created state for a while, and we can try to figure out why. But under normal circumstances the upstream auth would succeed or fail. When the handler gets this response it persists it into the Auth record which would emit an AuthSucceeded or AuthFailed event. In this simple system there probably isn't anything listening for either event, so no further work occurs. Maybe the Payment wants to update an aggregate status though, so it could listen for AuthSucceeded and AuthFailed

So far I've described events and handlers, but there's a third useful abstraction lurking, and we ended up calling it an Archiver. You might also know it as a Repository (from Eric Evans' book on Domain Driven Design) or as a Data Access Object (DAO). The Archiver's responsibility is to ensure that the domain objects' state changes appropriately, and that the relevant events are emitted on those changes.

In its simplest form, this event architecture boils down to:

+---------+
|         |
|         V
|  +-------------+
|  E    Event    E
|  +-------------+
|         |
|         V
|  +-------------+
|  H   Handler   H
|  +-------------+
|         |
|         V
|  +-------------+   +--------------+
|  A  Archiver   A<--C  Controller  C
|  +-------------+   +--------------+
|         |
+---------+
Event
Represents a thing that happened, usually a state change in a domain object.
Handler
Listens for one or more kinds of events and decides what to do about them. Handlers should only listen for Events, and talk to Archivers.
Archiver
Responsible for managing domain objects: what state changes are allowed, and emitting the appropriate events when those state changes occur.
Controller
Could be a typical web app controller, but could also be a cron job or other batch processor. Should only talk to Archivers. In the case of a cron job, that might mean having a domain object to represent a run of a given job, which would fire an event when it's created, thus kicking off a cascade of work…

This seems like a lot of work for such a simple example. But as usual, the Real World has a few twists in store for us (and we never know them all up-front). It turns out our upstream provider sometimes does fraud checks on auths, so they could return pending_fraud_check. In that case, the AuthCreatedHandler would just persist the pending response, emit an AuthPendingFraud event, and things would likely be done. At some time later, though, the upstream payments provider will come back with an affirmative or negative answer to our request to do an auth. That would come from a new request, but our events don't care. When the Auth status is updated, it would emit an AuthSucceeded or AuthFailed event just as if it had occurred in the synchronous case. And therein lies the magic. Our payments won't exist in a vaccuum. There might be a shipment that needs to wait for the payment to go through, or this might be a payment preceeding a disbursement of funds. Regardless of why an Auth went on to succeeded, most parts of the system only care that it happened.

We found that breaking work up into smaller, event-triggered pieces of logic made each bit easier to understand and test. They also tended to map directly from our business stories, which led to a few stories taking a lot less time than we'd expected. "Oh, that…just works!"

As a bonus to wiring up all the communications in the system in such a structured way, we extracted graphs of how an event could cascade through the system, and generated graphviz diagrams in the documentation to reflect what the code could actually do. Because a handler could have one of several actions based on an event (it could even choose to ignore it), these diagrams don't show what the system will do, just what it could do. We didn't get as far as providing "before" and "after" graphs showing the effect of a code change, but it was on my mind.


Egg Peggery, Shoulds, and Reality

2011-08-09T00:21:56-0400 | categories: debugging, programming

Last Saturday I attented a learn-in for Lernanta. Lernanta is the new project that runs p2pu.org (forked from Batucada which runs Mozilla's drumbeat.org). After getting the development environment set up (it's easy in an Ubuntu VM), I picked my first ticket: allow password resets via username as well as email address.

After looking at the source to see how password resets work, and asking on IRC regarding how it should work (Are usernames and email addresses disjunct sets? No.), I went to write a failing test. But on running the test suite, a full half of the tests broke with fairly catastrophic errors:

Traceback (most recent call last):
  File "/mnt/host/lernanta/../lernanta/apps/users/tests.py", line 63, in test_unauthenticated_redirects
    response = self.client.get(full)
…
  File "/mnt/host/lernanta/../lernanta/urls.py", line 5, in <module>
    admin.autodiscover()
  File "/home/taavi/lernanta-env/lib/python2.7/site-packages/django/contrib/admin/__init__.py", line 26, in autodiscover
    import_module('%s.admin' % app)
  File "/home/taavi/lernanta-env/lib/python2.7/site-packages/django/utils/importlib.py", line 35, in import_module
    __import__(name)
  File "/mnt/host/lernanta/../lernanta/apps/drumbeat/admin.py", line 21, in <module>
    Tag, Resource, Vote, Site])
  File "/home/taavi/lernanta-env/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 112, in unregister
    raise NotRegistered('The model %s is not registered' % model.__name__)
NotRegistered: The model Group is not registered

Searching for NotRegistered didn't turn up anything useful. Most errors people tended to see had to do with doubly-registering models, and those didn't appear related to the problem at hand.

Looking at the admin modules in Lernanta turned up something interesting. The drumbeat app was trying to unregister models like Group. Apparently this is because—in the context of drumbeat—those bits of django.contrib.auth aren't interesting and just contribute visual clutter. But the admin worked via the web interface, just not in tests. Why would things be defined properly in production use, but not in test? I searched for information about INSTALLED_APPS ordering, but the only messages I could find indicated that order shouldn't matter. But I had a feeling it did anyway. How could it not, given the code actually in the various admin.py files? Paul confirmed that order matters.

So I started dumping the order of loading the various admin modules in django.contrib.admin.__init__.autodiscover:

diff --git a/django/contrib/admin/__init__.py b/django/contrib/admin/__init__.py
index 2597414..26db254 100644
--- a/django/contrib/admin/__init__.py
+++ b/django/contrib/admin/__init__.py
@@ -27,6 +27,8 @@ def autodiscover():
     from django.utils.importlib import import_module
     from django.utils.module_loading import module_has_submodule

+    import pprint
+    pprint.pprint(settings.INSTALLED_APPS)
     for app in settings.INSTALLED_APPS:
         mod = import_module(app)
        # Attempt to import the app's admin module.

Lernanta's python manage.py test command uses nose under the covers, which captures logging and standard out while running tests and will print the contents on failure. I also used the -x flag to stop on first failure, so I didn't have to wait or wade through dozens of failures.

Through this I found that the order of settings.py was being changed! Once we had that figured out, Zuzel quickly pinpointed the problem in django-nose introduced on July 19th where INSTALLED_APPS is cast into a set(). I'm a bit embarassed that I didn't find that reference myself (I'd been suspecting a rogue call to set()), but my experience with pip (used to install Lernanta's dependencies) is limited at this point, and I never expected to find code in lernanta-env/src!

Rolling back to an older version of django-nose fixed the test failures. And at this point there's a new version of django-nose that doens't suffer from this problem.

Which brings me to egg peggery. If you're writing a Python library, you very probably don't want to peg your requirements to specific versions, because your consumer might need something different. But if you're writing an end-user app (like Lernanta), I highly suggest pegging all versions of all the dependencies in your virtualenv. pip encourages you to peg your dependencies' versions! If you don't, you will have no guarantee that installing your app next week will still work. The reality is that things change, sometimes breaking your assumptions. Don't assume more than you have to.

Explicit is better than implicit!


Beautiful, terse code

2011-04-19T15:10:00-0400 | categories: programming

I recently read Revisiting "Tricky When You Least Expect It" and it reminded me of the last chapter in Beautiful Code ("Writing Programs For “The Book”", p539) and of my own experience trying to refine the code describing a job spreader.

In all three cases, there was a simple, intuitive model for how to calculate a thing (or how to move data around). However, I've noticed with the simplest, most straightforward answers tend not to tell a story because they get right to the point.

I think there's something to be said for telling a story, even if it makes things more verbose, but at some point you just need to express a computation. The question is: at what level do you do that? Naturally, context matters a lot. So what's our context?

For these three examples, the context is something so small, that once you've written it at this lowest level, it doens't make sense to break it apart any more. You can understand what it does from its public behaviour, and if you need to know exactly how it does that, you'll have to read the code and really understand why it's implemented that way. Going to why is a different level of abstraction from what. It's why we write comments (which are obviously missing from the examples below).

A while ago when I started reading about TDD and modern unit-testing, I thought there should be a test for every method, public AND private on a class. Testing the privates naturally involves subclass stubs or mocks (or in Python, just ignoring the underscores), and it did seem a bit of work. A co-worker pointed out that you really shouldn't need to test the internals, unless they're actually that complicated…and if they're that complicated, why aren't they part of some other class' public interface?

So I feel this tension between writing things with good obvious names and descriptions (like the high-level picture of the job spreader) directly as code, versus the example below which has so few moving parts that it doesn't make sense to unit test anything smaller than the entire method (assuming that the Python library is sufficiently tested). But you really have to sit and think about the smaller end.

And maybe this whole discussion is for naught because we should just comment these opaque gems. But at some point, someone will still want to understand the mechanics, and that will take a lot more thought.

1
2
angle_diff(Begin, End) ->
   (End - Begin + 540) rem 360 - 180.
Example smallest angle difference function from Revisiting "Tricky When You Least Expect It"
1
2
3
(defun area-collinear (px py qx qy rx ry)
  (= (* (- px rx) (- qy ry))
     (* (- qx rx) (- py ry))))
Example co-linearity function from Beautiful Code, p549
1
2
3
4
5
6
7
8
9
from itertools import chain, ifilter, izip_longest
def spreader_generator(blockpool, spread):
    sentinel = object()
    blockpool_iter = iter(blockpool)
    feeders = [chain.from_iterable(blockpool_iter) for _ in range(spread)]
    stripes = izip_longest(*feeders, fillvalue=sentinel)
    flattened_spread = chain.from_iterable(stripes)
    not_sentinel = lambda x: x is not sentinel
    return ifilter(not_sentinel, flattened_spread)
Example job spreader