This blog post is based on a talk I gave at the Mile High Gophers Boulder Meetup on August 15, 2019. The talk is embedded below along with the notes I used. The slides are available here and the bookalyzer example application is available here.
This is not a transcription of the talk, it's the notes I used to prepare the talk.
Hi, my name is Caleb Doxsey. I’m an engineer at Bread - a startup based in NYC. But before that I worked at Datadog, a cloud-based monitoring company, where I worked on their alerting pipeline, and, crucially for tonight’s talk, Go tracing integrations for their APM product.
I should perhaps mention from the outset here that this talk is a bit slapdash, so I’m probably going to mostly read it as I prepared it… it’s also a bit quirky in places. I don’t know if it’s going to work, but hopefully there will be something in here that’s useful.
This talk will be broken into 3 parts:
First we will look at observability with logging, metrics and tracing.
Second we will dive into tracing with an example application.
And finally we will see how tracing is actually implemented in Go, and some tips and tricks I learned along the way.
Let’s start with logging. Logging is the act of printing out text to stdout or to a file. That text can then be stored or shipped to another location for analysis.
Early in Go’s development, Go received a fair amount of criticism for its lack of a debugger. I suspect most of you have used a debugger, but to clarify, debugging is the act of running a program with breakpoints and then stepping in, out, or over statements in the program as they are executed. It can be extremely helpful for diagnosing issues. So it made sense why people thought it was a major oversight.
However when it comes to maintaining large, multi-service, SaaS products, using a debugger is far less useful. You can’t usually attach yourself to a process running in production, and even when you can it’s probably not something you should do. Furthermore these services typically do many things simultaneously -- perhaps 100 requests are being handled at a time -- so the linear, sequential debugger workflow doesn’t make much sense for an application like that.
And if you’ve ever worked on applications like this, you probably know that most of your bugs aren’t all that easy to reproduce. They only happen some of the time, and you’re not quite sure what leads up to them. So using a debugger locally doesn’t tell you much.
In many ways logging is a lot like debugging. Instead of stepping through code via a user interface, we add log statements everywhere we think it would be useful to know something. Then when something goes wrong we take that log of events, and match it up with the code. With some luck we can piece together the logic and figure out what happened. And once we know hopefully we can reproduce the issue locally and fix it.
So for production systems, logging, or println debugging, actually makes a lot of sense.
Logging in Go
Go has two ways of doing logging built in to the standard library, fmt and log. You can see a couple examples here. But I would encourage you not to do logging this way, and instead use a 3rd party library.
Logging in Go: 3rd Party
Here’s a few libraries showing different approaches to logging. I’d probably recommend using the last, zerolog, which is careful to be quite efficient and avoid allocations (hence the zero).
The first, logrus, is probably the most popular, though I’ve been a bit turned off to it lately, due to the great logrus renaming fiasco of 2017… another disaster in a long line of dependency disasters, like the left-pad fiasco of 2016
Logging in Go: Logrus
What happened is the import path of the library was changed from Sirupsen/logrus (with a capital S) to sirupsen/logrus (with a lowercase S), thus breaking dozens of popular libraries.
Incidentally I love these github issues with hundreds of PR references. Someone should do a study on these kinds of changes, and the sheer amount of chaos that ensues. How many man-hours were wasted attempting to fix this? Could it have been predicted or prevented?
Anyway, back to the topic at hand, typically with logging you’re going to want to centralize these logs, so that you can find all the logs for all services in one place. A common solution to that is to use an ELK stack and you can see a screenshot of Kibana here. With this setup you can search both by text and by field and it’s a great way to correlate events across multiple systems.
Logging isn’t without its problems though. You’re likely to run into some of these:
First, you won’t have enough logs, and then after a few outages, you’ll invariably have too many, your indices will become absurdly slow and you’ll go through a few rounds of cleanup before it works again. This too much, too little, logging tennis can grow pretty tiresome, but there’s not really a great way to avoid it.
Some other common issues: logging personally identifiable information (or worse, passwords) and the need to scrub it, accidentally sending the wrong type of data to a field in elasticsearch, getting the esoteric logrotate options wrong and exhausting the disk of a node, or getting syslog settings wrong and accidentally turning low-cost log statements into expensive, blocking operations when the log-stack is down.
But despite these headaches, logging is something you should be doing.
Metrics in Go
Another approach to observability is to use metrics. Metrics are particularly useful for alerting and can track more than just your application. For example, you can measure the free disk space of servers, cloud-based metrics from services like CloudWatch, or even business metrics like sales. Metrics are extremely flexible.
For applications there are a couple ways you can go. The first is to use statsd.
Statsd has several different types of metrics: counters, gauges, histograms, etc. Here you can see how you call statsd from Go code,
and here’s a screenshot of a dashboard in Datadog.
Statsd uses an interesting UDP protocol intended to make your production code resilient to issues in your observability pipeline.
Though FWIW, in practice it’s not quite as straightforward as it may appear… Sure UDP will discard packets on the network, but the communication from your application may well result in blocking behavior, as your operating system buffers packets for sending, and once that buffer’s full, UDP calls might block.
You’ll probably never see that happen… but it could be a nasty surprise. I love it when widely deployed, fully realized technical designs intended to solve a very specific problem don’t actually solve the problem in practice, but nobody ever seems to notice.
Another approach to metrics is prometheus. Prometheus uses a pull-based model instead of a push-based model. Metrics are collected from an HTTP endpoint running in your app. This approach is well suited to Go apps, which already likely expose expvars or profile capabilities via an HTTP endpoint, so it’s just one more thing to add.
You can see an example of what that code would look like. Prometheus also has some more builtin capabilities, like the alertmanager for alerting, so it’s more of a full-featured solution than statsd. I don’t have much experience with it, but if you want an open-source solution, it’s definitely worth investigating.
Tracing in Go
And now we get to the main event: Tracing. Tracing is also sometimes called application performance monitoring. It’s sort of like logging, sort of like metrics and sort of like local profiling. It involves several concepts:
A Trace, which is a single, logical request, made up of multiple Spans A Span, which represents an operation Services which report the traces
Notably, a single trace usually involves multiple services. For example you might have a web server request which invokes an RPC method for a backend service, which in turn invokes a query against a database. That single trace might involve 3 spans coming from 3 different services.
Spans are sent to a centralized location where they can be connected via the trace ID, explored and analyzed.
Spans can also include additional metadata or tags. For example, you can record the URL that led to the request, or the SQL query that was invoked. Tracing solutions also usually have good support for flagging errors and recording stack traces (or at least error messages).
That’s all a bit abstract, so let’s take a look at an example.
I built this app called bookalyzer. It takes in Project Gutenberg URLs, downloads the books, uses a job queue to generate some toy statistics, adds data to a database, and stores the book in a local minio (that is, pseudo-s3) bucket.
The goal here was to see how a slightly complex, multi-service architecture would work with tracing, since the distributed nature of tracing is one of its most powerful capabilities.
So to recap, we took a look at logging, metrics and tracing. Why choose one over the other? Well you should probably consider using all 3. They are largely orthogonal, being geared towards different purposes…
You know that word orthogonal is kind of a weird one. I hear it a lot in the Go community. I guess it means at right angles, and I vaguely remember hearing it in geometry or linear algebra or some math class… Maybe I’m in the minority, but it’s not a word I use most of the time.
It’s actually a great example of something I see happen a lot in programming contexts. Maybe it happens in other contexts too, but we have this way of taking a difficult concept, and using an even more difficult concept to explain it.
I think my favorite example of this is the semaphore, which is used in multi-threaded programming for coordination between threads. It wasn’t until years after learning about semaphores that I discovered the term actually referred to a form of communication involving flag signaling, typically used in a maritime context…
And so we use ideas like this to explain a new idea, and the person we’re supposedly helping is just left more confused, but too scared to admit their lack of knowledge about whatever esoteric piece of trivia we offered… I will pretend to help you with an analogy, and you will pretend to understand what I’m talking about:
Oh yes, I remember, in my maritime youth, when I was but a lad, and my father taught me at his knee the ancient art of the flag semaphore. Thank you for your analogy. I totally understand multi-threaded programming now.
Anyway I digress.
You should be using all three of these tools. Logging can be vital in recreating the series of events that led up to a bug - an audit trail that can really save you in a pinch. Metrics are useful because they’re done for more than just applications and they’re particularly useful for alerting, since monitor definitions for metrics are fairly straightforward (ie, page me if we get down to 10% disk space left).
Furthermore, the combination of logging and metrics can be used for root cause analysis. You find a blip on one dashboard and it corresponds to a blip on another dashboard. The two events are almost certainly related. So much so that we sometimes called it the Law of Datadog. That is: Correlation is Causation.
And tracing provides a great, out-of-the-box, minimal setup way of seeing what your application is doing at a glance. It makes it very easy to spot potential performance problems, and what’s causing them (ie a bad sql query), as well as to see why errors are occurring.
These sorts of observability tools can be vital for maintaining production applications - the oft-overlooked secret sauce to successful SaaS products - the difference between a prototype and a real production application.
An Argument With Myself
Now wait just a second. I think maybe you’ve gone a bit too far your my praise of observability. Isn’t using tools like this just throwing in the towel? Conceding defeat that we’ll never be able to write software that works like it should… Shouldn’t we be spending more time in preventing these bugs from happening in the first place? We should be focused on tools way earlier in the process - at build time...
I can see your perspective on this, and perhaps putting some effort in earlier would be beneficial, but we need to be realistic about how software is built. These products are built by regular, journeymen engineers. They make mistakes. We need to assume that anything we launch to production is going to have problems, and these tools make it easier for us to find and fix those problems when they happen.
No, no, no… this is all wrong. I heard you say “regular, journeymen engineers”. Isn’t this precisely the problem? We need to be raising the bar here. Real engineers have standards. They take tests. They get licenses. They are certified by boards. We wouldn’t put up with a surgeon operating on us after taking a 6 week bootcamp, watching some video on youtube, or desperately searching stack overflow as he operates. Why are we putting up with this in software? Software is becoming more and more important and using tools like this just encourages the problem.
No. I’m sorry. If anything that is exactly the opposite direction from where we need to be headed. Programming is too hard. We need to make it easier. We need to lower the barrier of entry. Tools like these, and languages like Go, make successfully maintaining production systems possible for far more people.
This is ridiculous. Have we learned nothing from the history of programming? How many languages have been built in a fool-hearty attempt to make programming “easier” and “more accessible”. With a dwindling cabal of geriatric COBOL programmers, surely it’s time to take a different approach. We’d be far better off using a language like Rust, with it’s enforced memory-safety and zero-cost abstractions - a language suitable for experts; experts who will produce far better software with far less bugs.
Rust. I’m sick and tired of hearing about Rust. I see your COBOL and raise you Ada. A forgotten language that no one talks about that was just as focused on safety and correctness as Rust. And it was made decades ago and hardly ushered in an age of reliability. You’re the one who’s being incredibly naive. These are distributed systems we’re talking about. For all it’s benefits, fixing memory safety on a single machine does nothing to prevent the kind of large-scale bugs that can affect SaaS products. Thundering herds, resource contention, data races, consensus bugs. Just look at the Jepsen tests. Even the best products, built by experts in their fields, routinely fail and behave in surprising ways no one expected. Programming is hard, and distributed systems programming is even harder. All the more reason to focus on observability. It’s really our only hope…
Ok I think that’s enough of that. We’ll leave those two to themselves.
Let’s get back to looking at how tracing is done in Go.
How Tracing is Done in Go: Setup
There are two major, open-source libraries for tracing in Go: opentracing and opencensus. Opencensus is a bit more modern, but there appears to be more integrations out there for opentracing. There’s a new project to merge these divergent codebases. Hopefully the ecosystem won’t be so splintered in the future.
Tracing is turned on by first running a bit of setup code, where you can control how data is exported (ie to Jaeger as in our example), as well as options for the name of the service that’s running and sampling.
How Tracing is Done in Go: Usage
Then inside actual application code, you Start a trace and then later Finish it. This tracks how long it took and also makes it possible to create sub-spans within the trace.
You can also set metadata on the span which will be exported when the span is finished.
How Tracing is Done in Go: Context
The real key to tracing in Go though is the context package. Context provides three things:
It’s the latter capability that we need for tracing. We store the span in the context metadata. This way whenever we create a new span, we derive it from whatever is stored in the context. If there is no span in the context, we create a root span.
How Tracing is Done in Go: Context w/HTTP
The beautiful thing about context, is that if you manage to use it everywhere, it can also cross API boundaries. For example you can take the metadata stored in the context, and convert it into HTTP headers. Those HTTP headers can then be read on the other end and replicated into the context, thus allowing us to create a child span from a parent span started in an entirely different service.
How Tracing is Done in Go: Context w/GRPC
Here’s an example with gRPC. Notice how every client call and every server method has a context as the first parameter. Building APIs in this way provides an easy way to implement tracing without having to change any of the underlying code.
How Tracing is Done in Go: Integrations
One of the nice things about tracing is that a lot of the time you don’t even have to add this tracing code at all. There are a lot of existing integrations out there that allow you to add tracing early in the process, often with a single line of code, and nothing else has to change in your project.
This off-the-shelf integration capability is one of the most powerful benefits of tracing. Tracing can often be added to a quite-large application in just a few minutes, and you can immediately see the benefits without having to make large-scale changes (as you might with logging or metrics).
Let’s take a look at some of the ways we can implement integrations.
Integration Approaches: Middleware
The first way is also probably the easiest, if the library you’re working with supports it, and that’s a middleware.
In this example we’re creating a tracing library for chi, an HTTP framework that supports middlewares. A middleware takes in an http.Handler and modifies it in some way and then returns a new http.Handler. For us what that means is we start a span, run the original handler, and then finish the span. In that way every HTTP call can be traced automatically.
Integration Approaches: Interceptor
Similar to the middleware is the interceptor. In gRPC they have options you can pass to the Dial and NewServer methods.
So here’s an example of how they get used. Notice we just pass in the WithUnaryInterceptor option, and that function will be invoked on every request, which makes it easy to start a span and finish it when the RPC call completes. Furthermore because we pass a context around to all these methods, we can tie that span to the originating trace, as well as add it to the metadata headers on the gRPC request so they can make it to the server.
Both of these approaches are ideal, but they require that the library you’re attempting to build an integration for was intended to be instrumented ahead of time. Someone has to create these sorts of APIs. When they’re not available, and spoiler, they’re usually not, we have to fall back to other means.
Integration Approaches: Replacement
One rather extreme approach is to just replace the library entirely. This appears to be what the opencensus folks did for the redis libraries. They forked the repos and made the modifications they needed to add tracing.
I generally would not recommend this approach as it ends up changing all the import paths and requires a lot of maintenance to keep up with upstream changes.
Integration Approaches: Interfaces
Another approach can be done when you’re working with an interface. For example the Sarama Kafka library has a Consumer interface which has several methods. If we create a custom type that implements that interface, we can wrap the original interface and hand back our custom type, and any existing code should continue to work as it did before, because we implement all the same methods.
Integration Approaches: Embedding
Furthermore a nice trick for doing these kinds of integrations is to use embedding. In this example we can see how the original sarama Consumer has been embedded into our custom type. By doing this we can now invoke all the Consumer methods on our type without having to create new methods.
Typically we only need to change one of the methods. To do that we just attach one method to the type and that method will take precedence over the original, but all the other methods are left the same.
This is particularly important for future-proofing your integration. If they ever add new methods to the Consumer interface (a generally safe operation that doesn’t require a major version bump), you will get them automatically without having to make changes to your integration.
Integration Approaches: Version Compatibility
On that topic of future-proofing, one of the things that can bite you is when libraries change in incompatible ways. For example, the redis library changed their String method on the Cmder type from a single string return value, to two return values: a string and an error. We needed to support both versions with a single piece of code. So how do we do that?
We utilize a type assertion. We create an anonymous interface and do the check to see which version of the String method we’re dealing with.
A bit tedious, but writing code like this allows us to support more than one version of a library easily.
Integration Approaches: Embedding Craziness
And now for what may be the craziest bit of Go code I ever had the pleasure of writing.
We wanted to add an integration for the http.ResponseWriter, so that if there was a 500 status code, it would set the error tag on the metadata. So we wrapped the response writer in a similar way to how we saw earlier.
We embedded it into a new struct, and then created a new WriteHeader method.
Integration Approaches: Embedding Craziness (2)
This worked great for code that used a ResponseWriter. Unfortunately, as you may or may not know, there are other interfaces in the HTTP package which the ResponseWriter can also implement. For example there’s the Hijacker interface, which is used by websockets to convert an HTTP connection into something like a TCP connection.
You can see an example of how it works here. A type assertion is done on the response writer, and if it passes we the response is hijacked.
Unfortunately our wrapper does not implement the Hijacker interface, so that check would fail, which would mean if you used websockets with our tracer integration, suddenly it would fail. Exactly the opposite of what we’re trying to do. We want a seamless integration. So how do we fix it?
Take a look at this last bit of code. We create our wrapped response writer, but then we do a type assertion ourselves, and if it passes, we change w from our response writer to an anonymous struct which contains both a response writer and an http hijacker. This magically makes w implement both interfaces, and fixes the problem.
This code is a bit confusing, so let’s take a closer look.
Integration Approaches: Embedding Craziness (3)
Creating an anonymous struct is basically the same as create a type and then using it, as we can see here.
And remember embedded fields have their methods promoted. So if I create a struct which embeds a ResponseWriter, my struct is a ResponseWriter too, and if I invoke a ResponseWriter method, like Write, that actually calls the embedded fields method, as we can see here.
Integration Approaches: Embedding Craziness (4)
Having two embedded fields works precisely the same way. The Write method goes to the ResponseWriter Write, and the Hijack method goes to the Hijacker Hijack method.
Though what’s particularly strange in this example, is we then reassign our new multi-interface to the original variable. And because a ResponseWriterHijacker is at least a ResponseWriter, this is allowed.
Clear as mud?
Integration Approaches: Embedding Craziness (5)
But wait there’s more. There’s actually 4 additional interfaces in the http package: Hijacker, Pusher, CloseNotifier and Flusher. So how did we support those?
Well here’s the code in all it’s glory. We have to handle every possible combination of these interfaces and use a giant switch statement to create the anonymous structs. Crazy but it solves the problem.
Well I think that’s enough for tonight. Thank you for listening. Any questions?