On Counting

May 14, 2017 at 8:00PM

Caleb Doxsey

The follow is based on a true story though many details have been changed.

Auditing

It was the beginning of the month, early in the morning. As an early riser I treasured this time when most engineers were still asleep and the office was quiet, distraction-free and nearly empty.

I say nearly empty, because there are almost always business folks around at that hour. Accountants, members of the legal team, executives even a handful of sales-people. The kind of people who had probably worked in other industries and kept schedules altogether different from the strange, psuedo-business world of startups.

I was just getting started with my day - I'd pulled down the latest code and was reviewing a bug ticket - when I noticed someone standing next to my desk. It was a member of the billing department and she was telling me there were irregularities in the audit data that they used to generate invoices.

It's not clear how my team ended up being responsible for this system. We mostly worked on APIs and the frontend. And the auditing system was an orphaned, legacy system, whose code hadn't been touched in years that no one wanted to work on.

It was also the only way we made money.

Isn't it strange how that happens in software companies? How the most important projects are often the ones nobody wants to work on?

Software developers are an odd bunch. The skills needed to build software are certainly in the engineering vane, but temperamentally, software developers - especially the ones you find in fast-paced startups - are a lot more like creatives than engineers. They're often quite passionate about what they do, which can lead them to be very productive and work crazy hours for a pittance.

At least for a while. Burnout is lying in wait for the passionate, hapless engineer who gets too near his limits - a lesson that usually has to be learned the hard way, especially for the young. Burnout can lead to mental collapse, social alienation, depression. It can even kill their creative ability entirely, like a writer suffering writer's block. They'll just stare at that blank screen for hours unable to solve the most trivial of problems anymore.

It's like the adage about overtime for an overruning project. In the first week or two of 60 hour overtime, you'll get nearly a 50% increase in output. By the 3rd or 4th week that will drop precipitously, and you'll be lucky to break even. By the 5th week you're actually getting less than the original 40 hours you started with. People are still in the office, but work isn't actually getting done anymore.

The paradox is that the harder you work the harder it gets to work. Some developers never recover from burnout and grow to hate the very thing which had, at one time, brought them so much joy.

The second major downside to passion is that not everything stirs that passion. Most developers want to work on hard problems, the next big thing, a new programming language, framework or technique, or that algorithm they just read about in a paper. Fixing a bug in an ancient, decrepit code base for an application only tangentially related to the product probably isn't high on the list.

And yet, if we were honest, so much of what actually needs to get done - of what actually brings value to a company and customers lives - is the mundane day-to-day fixes of the middlebrow engineer.

The Problem

This was the 3rd or 4th month in a row that the auditing system failed to produce accurate data. A month was just long enough for engineers to forget when it came around again - the urgent almost always supplanting the critical - but the billing department certainly hadn't forgotten.

The auditing system ran a job that took hours to complete as soon as the previous month had finished and generated a count of every piece of social media data we sent out the door. We billed by the tweet, the foursquare check-in, the facebook post, etc.

And each of these counts were for the number of unique items. Ours was a reliable and resilient system which meant that we often sent duplicate data to customers when there were network problems (as there all too often were over the internet). But we didn't want to bill customers for this duplicate data.

All of our systems generated audit logs which were stored as files in S3. An incredibly simple data format, these audit logs consisted of a newline-separated list of unique identifiers, where the filenames indicated the time period they covered, and the source of the social media data. An example:

2017-01-15T00-12-30--2017-01-16T00-01-24--twitter.log.gz

Billing involved counting all of these identifiers for the entire month for each customer and then charging for the number of unique items.

The original solution to this problem was built as a map reduce job using a Hadoop cluster. All the rage at the time, the Hadoop solution proved to be quite a headache. The jobs took hours to complete and, for whatever reason, seemed to fail all the time. I can't really speak to it, because I never saw it, and only heard about it in passing.

The second solution was a simple cron job on a single node that ran a bash script. The bash script would download the files from s3 then run a one-line piped set of commands which looked something like this:

find *.gz | xargs -I{} echo "<(gzcat {} | sort | uniq)" | xargs echo sort -m | bash | uniq | wc -l

And that was it.

If you don't speak bash, it's best to start at the end. The oddly named wc is a command used to count things. When you pass -l to it, it counts lines. For example, this file would return 3:

a
b
c

uniq removes duplicate lines from the input file. It assumes the lines are sorted. sort sorts the input, and it has this special switch -m which takes multiple, already-sorted files and merges them together. We look for any *.gz files, unzip them with gzcat, sort and uniq them, then pass those results to a named pipe, which is then passed as arguments to the sort -m. In this way we count the number of unique lines in all the files.

The script was elegent in its simplicity, and was apparently both more reliable and faster than the system it replaced. As the story was passed on year after year, it stood as a monument to an age when men were men and hackers were hackers; who'd cut their teeth on mainframes, tape drives, punch cards - on machines that had almost no disk space or memory and whose cpus were slower than your average calculator. And yet the software they built stood the test of time and made a mockery of whatever the latest and greatest, cutting edge nonsense today's developers were churning out. This was the kind of software myth passed on to each new generation of developers as a cautionary tale in the dangers of over-engineering.

And now it didn't work anymore.

Customers had grown so large that the script couldn't really handle the load. Audit jobs started a few hours after the end of the month (to allow any stragglers to post data), and jobs might take upwards of 16 hours to complete. Invoices were due by the 10th, so ideally we'd have good data to hand off to the billing department by the 2nd or 3rd to give them enough time to prepare the invoices.

At 16 hours this meant we only really got one retry if something went wrong. Common issues were that the machine might run out of disk space or memory or there might be many errors from S3 causing the job to fail. Also sometimes EC2 instances just die. Once you factored in the time it took for a human operator to detect, investigate, correct and re-run jobs, it was no surprise why we were always pressed for time.

A new system was needed - one which could give you unique counts much more quickly, on the order of minutes, not hours, and, ideally, could be queried on demand, so that we could identify issues before they became real problems.

The Solution(s)

So we found a whiteboard and came up with some ideas.

The first thing we decided to do was to stop saving the audit logs to S3 directly and instead write them to a Kafka queue and have another application writing them to S3. This was both conceptually simpler for upstream services and allowed us to change the underlying implementation of the audit logs without having to change a dozen applications.

Using a queue like this is also more reliable, as S3 does occasionally have problems, and a Kafka queue means we'll just fall behind for a while instead of dropping the audit records. Bend, don't break.

With that in place we were free to pursue an alternative storage mechanism that could provide more efficient counting.

As it turns out, counting the number of distinct items is quite the studied problem in Computer Science. I knew of two basic solutions: linear counting a list of unique items (ala the bash script above) and set counting using an indexed data structure like a hash table or binary tree. (I'll come back to more clever solutions in a bit)

Since we wanted better than linear performance (ie a billion tweets shouldn't take a billion operations to count), we explored the indexed option. We wanted to use something off the shelf, and there were a few databases we had on hand:

Redis

Redis has a set data structure for storing sets of strings. Using this we would store one set per social media provider, per org, per month with each unique id added with sadd. The count can be computed with scard in constant time.

The pros of redis are simplicity and horizontal scalability. Scalability is achieved using partitioning. Simply take the set key and do modulo arithmetic on the number of Redis nodes. (alternatively use consistent hashing or one of its cousins)

The cons were cost and the lack of high availability. Everything in Redis is stored in-memory, which, especially at the time, was quite wasteful. Given that we were fine with O(minutes) to compute the counts, it seemed like a disk store would've made more sense. High availability can be achieved using something like Sentinel, though at the time I'm not sure that really existed as an option (at least we didn't know about it), and even with it, it requires a fair amount of operational setup.

Cassandra

Cassandra is an efficient, replicated, sorted key value store. Using keys like: ((social media provider, org, month), unique id) it's possible to store a very large number items efficiently.

The pros of Cassandra were that we were already using it for stuff and we knew it could handle the write load and Cassandra has reliable replication built-in. The main con was that Cassandra doesn't implement counting efficiently. It's an O(n) operation, albeit one that Cassandra can do surprisingly quickly, certainly a lot faster than the bash script.

MySQL

We also had RDS instances available, so we considered MySQL as an option. MySQL can enforce the uniqueness with an index, so it ends up being a very simple table.

The main pros here were simplicity and not having to manage our own infrastructure. The cons were that MySQL doesn't actually implement that counting all that efficiently and it didn't play well with the write load.

Local Store

At the time I was familiar with some local on-disk stores. In particular LevelDB and Kyoto Tycoon. They can implement unique key value stores efficiently.

The pros were surprisingly good performance and simplicity. The cons were the lack of high availability and us not being altogether familiar with using them. (on-disk stores have since become far more commonplace)

On Redundancy

I tested each of the solutions to see if they were viable. Redis wasn't worth the cost, MySQL couldn't handle the write load and the LevelDB solution was too scary, so we went with Cassandra. This despite it not actually being very efficient at counting things. So we rolled that out which substantially improved the performance of our end-of-month auditing jobs.

But we didn't discard the old system.

We spend a lot of time thinking about how to build a reliable system that can handle the failure of nodes, network partitions, performance degradation, etc... But all too often we fail to account for one of the most common causes of system failure: human error. People make mistakes. Sometimes that means running the wrong command, ala that S3 outage the other day:

An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. https://aws.amazon.com/message/41926/

Though more often it has to do with a failed change, for example CloudFlare's particularly nasty memory leak bug:

About a year ago we decided that the Ragel-based parser had become too complex to maintain and we started to write a new parser, named cf-html, to replace it. [...] Introducing cf-html subtly changed the buffering which enabled the leakage https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/

That particular outage is illustrative because it was indirect and the result of a latent bug in an entirely different system. Who could've seen that coming?

We may not be able to predict how systems will fail due to human error, but we can predict that they will, inevitably, fail, and design our systems accordingly.

Bearing that in mind, in this case we left the legacy system in place as a check and failback mechanism for the new system. Since the systems relied on entirely different mechanisms and code bases, the chances of an issue affecting both were slim. Taking those ideas further, there are other things that can be done:

Not only building redundant systems, but use redundant teams. This violates DRY and seems incredibly inefficient, but it reduces risk. Eliminating redundancies in the name of efficiency is short-sighted when you factor in the reality of black swan events.
We should build in continuous tests on the real system. Create a fake customer, consume real data and confirm that the audit system produces the expected result. Run this all the time so we can detect if we break it in the future.
Add in metrics to the application to analyze performance, error rates, unusual activity, etc. Collecting this data over time is crucial for root-cause analysis.
Design mitigation strategies in the software itself. For example, if something happens every minute, define that through a config so that it can be changed. Are there ways to fail partially and not completely?

Alternatives

It turns out counting distinct items is a well-studied problem, and there are sophisticated techniques to do this counting more efficiently than the solutions described above. For example there's the count-min sketch:

In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch

Or the HyperLogLog which actually exists in Redis these days:

There is a class of algorithms that use randomization in order to provide an approximation of the number of unique elements in a set using just a constant, and small, amount of memory. The best of such algorithms currently known is called HyperLogLog, and is due to Philippe Flajolet. HyperLogLog is remarkable as it provides a very good approximation of the cardinality of a set even using a very small amount of memory. In the Redis implementation it only uses 12kbytes per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, u nless you approach 2^64 items (which seems quite unlikely). http://antirez.com/news/75

On the face of it, these would appear to be far superior options, but (1) I didn't really know about them at the time (I don't think HyperLogLog was even around), and (2) I don't think they would've been viable anyway. One could imagine counting with a 0.81% error rate, and simply providing a discount to the customer for the inexactness, therefore guaranteeing that you never over billed them. But what contract is going to contain something so esoteric? And what customer would understand why counting something was so hard?

So, to go full circle, if I were to do it again, I'd probably spend most of my time building something clever with a HyperLogLog, only to eventually cave-in and resort to something inefficient, bland and boring.