Log Analysis






with IPython, matplotlib, etc
Taavi Burns @jaaaarel
Defect Poacher & Futurist at FreshBooks

What kind of log analysis?

I'll show two things I've done with web application server logs:

Outline

tools

What is IPython?

IPython provides a rich toolkit to help you make the most out of using Python.

from http://ipython.org/

tools

What is IPython?

Default IPython startup and prompt.

tools

What is IPython?

tools

IPython Notebook?

The IPython Notebook consists of two related components.

from http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html

tools

IPython Notebook?

tools

IPython Notebook?

Example screenshot of IPython Notebook in Chrome.

tools

IPython Notebook prerequisites

Not too bad.

$ brew install zeromq
$ ARCHFLAGS="-arch i386 -arch x86_64" \
    pip install pyzmq
$ pip install ipython

unless you're on Snow Leopard with XCode 4 (and need the ARCHFLAGS)

tools

What is matplotlib?

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

from http://matplotlib.sourceforge.net/
and don't forget to pip install matplotlib!

tools

What is matplotlib?

matplotlib can be used in python scripts, the python and ipython shell (à la MATLAB® or Mathematica®), web application servers, and six graphical user interface toolkits.

aka pylab

tools

All together now

$ ipython notebook --pylab inline

'ipython notebook --pylab inline' showing a sinc graph

tools
logs

Parsing log files

logs

2013-07-24T00:12:30-04:00 ordapp04 evolve[9053]: 5101 INFO [evolve.service.responder:92] Request ID: 1234567890abcdef1234567890abcdef, Process Accounting: 0.053691s 4user 0sys 1102692352vsize 210985rss (net 0vsize 0rss), Credentials(12345, 1, -1, 3, '127.0.0.1', '12345123451234512345', 1234567890abcdef1234567890abcdef, datetime.datetime(2012, 7, 23, 23, 12, 30, tzinfo=), None) - invoice_InvoiceResource - search(([u'lines', u'direct_links', u'contacts'], {'per_page': u'100', 'page': u'1'}, u'invoice_id_desc', {u'statusids': [u'0', u'2', u'3', u'6', u'7']}), {}) - rows: 3 - queries: 6 - query_time: 0.0169

2012-07-24T00:12:30-04:00 ordapp04 evolve[9053]: 5101 INFO [evolve.lib.base_controller:92] Request ID: 1234567890abcdef1234567890abcdef, Process Accounting: 0.062353s 4user 0sys 1102692352vsize 210985rss (net 0vsize 0rss), Full Request

logs

Which leads us to write

import re

full_request_text = r'\s+'.join([
        r'^(?P<year>\d{4})-(?P<month>\d\d)-(?P<day>\d\d)T'
            r'(?P<hour>\d\d):(?P<minute>\d\d):(?P<second>\d\d)'
            r'(?P<tzdir>[-+])(?P<tzhour>\d\d):(?P<tzminute>\d\d)',
        r'(?P<host>\w+)',
        r'(?P<application>\w+)(?:\[(?P<pid>\d+)\]:)?',
        r'(?P<port>\d+)',
        r'(?P<log_level>\w+)',
        r'\[(?P<module>[.\w]+):(?P<line>\d+)\]',
        r'Request ID: (?P<request_id>[0-9a-f]{32}),',
        r'Process Accounting:',
        r'(?P<elapsed>\d+(\.\d+)?)s',
        r'(?P<user>-?\d+)user',
        r'(?P<sys>-?\d+)sys',
        r'(?P<vsize>-?\d+)vsize',
        r'(?P<rss>-?\d+)rss',
        r'\(net (?P<net_vsize>\d+)vsize (?P<net_rss>\d+)rss\),',
        r'Full Request$'
    ])

full_request_re = re.compile(full_request_text)
logs

And this

import calendar
from time import struct_time

def add_posix_timestamp(row):
    t = struct_time((
        int(row['year']),
        int(row['month']),
        int(row['day']),
        int(row['hour']),
        int(row['minute']),
        int(row['second']),
        0,
        0,
        0
    ))

    if row['tzdir'] == '+':
        tzdir = -1
    else:
        tzdir = 1

    seconds_offset = tzdir * (int(row['tzhour']) * 60 + int(row['tzminute'])) * 60

    row['posix_timestamp'] = calendar.timegm(t) + seconds_offset
    return row
logs

namedtuple

When recording tabular data in Python, use collections.namedtuple. It generates fast code, and each instance consumes only the space of a tuple!

logs

Transferring files from host to host

Sometimes you have to munge files on one host, and dissect them on another.

JSON (or msgpack) is tempting, but you'd be surprised how effective Python's csv module is.

logs

Let's see some log munging!

"Demo" time…

See the Log Munging notebook.

logs
analysis

Aggregations

Actually, this is really simple in terms of map/reduce…

We want to map each event into a temporal bucket, and reduce each bucket of data points into some aggregate value

analysis

Aggregations

Diagram of events being mapped (grouped) and reduced (aggregated)

analysis

Framework

from itertools import groupby

def plot_data_for(data, map_, reduce_):
    keys = []
    values = []
    for k, v in groupby(data, key=map_):
        keys.append(k)
        values.append(reduce_(v))
    return (keys, values)
analysis

Map

def by_minute(e):
    return e.posix_timestamp // 60
analysis

Reduce

# percentile() is from
# http://code.activestate.com/recipes/511478/
# and requires a sorted list as input

def percentile_of(field, p=0.5):
    def f(l):
        data = sorted([getattr(e, field) for e in l])
        return percentile(list(data), p)
    return f
analysis

Voilà!

plot(*plot_data_for(
    list_of_records,
    by_minute,
    percentile_of('elapsed', 0.5)
))

This makes it easy to filter data using list comprehensions, adjust bucket size and offsets, and compute different aggregate metrics.

analysis

Prelude to a demo

analysis

More demos!

See the Visualizing notebook.

analysis

Before vs after

Back in May we started using metaclasses to make DTO instantiation faster (thanks @mpirnat!)

Fast requests stay fast, but large-data requests get faster

Showed up in the 99th percentile of all requests

But can we be more specific? Yes!

analysis

Hypothesis

invoice.list requests return the most data, using the most expensive DTOs (each invoice has lines, each lines has a bunch of moneys, etc).

We expect that response time correlates with the quantity of results returned.

analysis

Demo time!

It's easier to just show it in action. :)

See the Visualizing notebook.

analysis
pandas

pandas?

I kept hearing that it's the hottest thing in data munging, so it would be bad to not mention it at all.

pandas

Yes, it works!

In fact, I managed to (mostly) rewrite all the previous examples in terms of pandas!

Again with the Snow Leopard: it ships with an ancient version of python-dateutil, so you'll have to pip install --upgrade python-dateutil before using pandas

pandas

Caveats

pandas

Caveats

pandas

Demo

Let's dive in!

See the Pandas notebook.

pandas
Presented at PyOhio 2012
with impress.js