Errors in Go

Errors in Go are a contentious topic. Everyone has an opinion on them. I’m not going to share mine here, nor am I really going to engage with the controversy of them. Instead, I’m going to accept that the tree grew in the way it did, and explore how to best cut with the grain, rather than fighting against it.

I was discussing error handling in Go with some coworkers, and the discussion prompted me to break the problem down in ways I found useful. So I’m writing them down, both so I remember my conclusions and so I can more easily share them with people I collaborate with.

The Setting

Stop me if you’ve heard this one. You’re working on a Go service, and it’s time to bring it past a prototype. You need to surface proper HTTP responses when you encounter an error. You need to log them, ideally with a stack trace, in your logging service, your error reporting tool, or ideally, both. And when writing your HTTP handlers, you need to know if an error occurred, so you can stop processing. Sometimes, you need to know which error occurred, so you can recover or customize your response appropriately.

Fortunately, Go’s error type is an interface, and you can implement your own errors in your application. So you build a type that holds an HTTP status code, maybe a message to display to the end user as part of the HTTP response, and a stacktrace. You start using this custom error type everywhere in your application, and set up a linter to make sure people don’t accidentally use another error. Maybe you write a helper for converting other errors into your custom error.

Suddenly, you have headaches. You want to wrap errors coming back from helper functions, but oops, some of those errors were already your custom error type, and you just overrode their HTTP status code and have the wrong stacktrace and it’s a mess. Okay, you build a check into your wrapper to just return without wrapping the variables that are already your custom error type. Except whoops, now you’re missing the added context of the message that was passed in to wrap with. Worse, you’re now seeing some mistakes that are arguably security problems. It seemed like a good idea to return a custom error with a 404 status code from the helper for looking up an account if that account couldn’t be found, but it turns out that’s a problem in auth code and you really want a 401.

This isn’t working. We keep needing to be careful, and there are some mistakes it’s really easy to make that reduce the quality of our errors and make it harder to figure out what’s happening in our program.

Something’s wrong with our design.

Who Is An Error For?

One of the first things I think about when designing software is who my audiences are and what they want. Errors have two responsibilities: indicating what went wrong, and indicating the context it went wrong in. Who the error is for, and what they plan to do with it, informs what context we want to associate with the error.

Errors Are For Operators

Operators are who we generally think of when thinking about the audience for an error. The people who interact with a program from outside the process boundary, generally tasked with keeping it running and working well.

Operators want errors because they want to know what broke, how it broke, and ideally, what made it break. The context an operator wants, therefore, is very zoomed in. They want to know the stack trace, the state of the program when the error occurred, and what it was trying to do.

Errors Are For End Users

End users care about errors, because end users want to know what to do next. Sometimes that means correcting the information they entered in a form. Sometimes it means checking an online archive of the site. Sometimes it means just waiting until whoever got paged can fix things.

Letting an end user know what’s going on means they can adjust their expectations and behaviors accordingly.

End users want the wider context, to put the error in perspective. They don’t care about stack traces and state. They don’t care about whatever a database constraint violation is. They just want to know the username is already taken.

End users care about what an error means, not what caused it.

Errors Are For Callers

The caller of a function has a vested interest in whether the function succeeded and, if not, the specific way in which it failed. But the callers of a function have more responsibilities than that. They’re responsible for ferrying errors to end users or operators, as well as to any callers of their own.

This is where things start to get dicey. Because an operator wants the zoomed-in what of an error. An end user wants the zoomed-out so what of an error. Callers want neither, they just want to know what state they’re in and whatever information they need to attempt an automated recovery, if possible.

A single error type can’t meet all those concerns all at once.

Error wrapping helps. You can wrap an error in the context that an operator will want, and callers can strip it away to get to the bare state. But that frustrates attempts to put the error in the wider perspective for the end user, because any callers are going to attempt to bring the error back to the zoomed-in context, for the operators. Worse, you’re attempting to put things in the wider context of, say, an HTTP status code while deep within the call stack, where that context just isn’t available to you. You can make an educated guess, at best.

Stack traces, likewise, get messy. You’re passing around a variable with a stack trace in it, and it’s invariably getting actually reported from somewhere different from where the error actually occurred. You haven’t tricked Go into having exceptions, you’ve just taught your log line numbers to lie to you. God help you if you end up with multiple stack traces because of wrapping.

Why An error, Though?

At this point, I like to go back and revisit my assumptions. Is an error even the right way to communicate this information?

Why are we using an error?

errors, in Go, are just values. Why do operators and end users need values in the runtime?

They don’t.

errors Aren’t For Operators

What an operator wants is an error report, not an error value.

They’re not going to handle it at runtime. They don’t need it stored in memory. They don’t want it passed around the call stack.

What they actually want, at the root of it, is for an error condition to trigger a log line or an error report, with a bunch of context at that point in time.

That sounds like a function call to me.

That sounds, honestly, like slog to me: a function call that reports a message, some caller-defined context, and a stack trace.

Operators don’t want stack traces in your error variables, they want function calls at the point errors actually happened.

Everyone will hate this take because the much-reviled

if err != nil {
	return err
}

gets even more verbose, turning into

if err != nil {
	slog.ErrorContext(ctx, "failed to get account", "error", err, "account.id", 123)
	return err
}

but I call ’em like I see ’em, and the thing operators actually want here is best served by a function call at the point in the stack where it happens, not passing around a variable.

It’s probably worth being explicit here that I’m using slog.ErrorContext as an example, because it already has the function signature I want. You could have a slog.Handler that surfaces the error however your operators want it, or you could use your own function. The key point here is it’s a function, not an error.

It’s also fair to note that people need to remember to actually call these functions in error states. A counterpoint to that, though, is that people need to remember to inject the stack trace into the error when returning it. If you can have a linter for one, you can have a linter for the other.

Likewise, it’s fair to note that you’ll get multiple error reports for the same error condition, as the error travels back up the call stack. But that’s easily solvable, without any real compromises the way trying to deduplicate stack traces has compromises:

type reportedError struct {
	error
}

func report(ctx context.Context, msg string, err error, args ...any) {
	var alreadyReported reportedError
	if ok := errors.As(err, &alreadyReported); ok {
		return
	}
	args = append([]any{"error", err}, args...)
	slog.ErrorContext(ctx, msg, args...)
}

func getAccount(ctx context.Context, accountID int) error {
	err := queryDatabase(ctx, accountID)
	if err != nil {
		report(ctx, "failed to get account", err, "account.id", accountID)
		return reportedError{err}
	}
	return nil
}

As long as after each call to report the error is turned into a reportedError, you’ll only report it once, no matter how deep in the stack you are. Again, people need to remember to do it, but if you can lint for only constructing errors a certain way, you can lint for only returning them a certain way. If you get clever enough with the code, and do enough magic, you can probably have report do the wrapping itself.

Or, heck, you could take the easy way out:

func report(ctx context.Context, msg string, err error, args ...any) error {
	var alreadyReported reportedError
	if ok := errors.As(err, &alreadyReported); ok {
		return
	}
	args = append([]any{"error", err}, args...)
	slog.ErrorContext(ctx, msg, args...)
	return reportedError{err}
}

func getAccount(ctx context.Context, accountID int) error {
	err := queryDatabase(ctx, accountID)
	if err != nil {
		return report(ctx, "failed to get account", err, "account.id", accountID)
	}
	return nil
}

errors Aren’t For End Users

An end user doesn’t actually want an error value either. Like an operator, they also want an error report—just the opposite kind of error report.

An end user doesn’t want the error report authored at the point in the call stack the error occurred, the end user wants the error report authored at the point in the call stack that understands their request. They want the handler to write the error report, because the handler has the relevant information to guide them to next steps and to indicate precisely what about their request went wrong.

The end user wants the handler to detect specific error states, and translate them into something they can understand.

My preferred way to do this is with a small package I wrote called apidiags, which is really just a struct definition and some helpers. It’s based incredibly heavily off of the zcl.Diagnostic type that Martin Atkins wrote for ZCL. But honestly, whatever your scheme for communicating an error to the end user is, it doesn’t belong wrapped in an error type. It’s probably a function call that, e.g., writes a set of bytes over a wire, or renders text to standard error, or whatever. Or it’s setting a value on some response state that will be written to the user later. But importantly, that value on the response state is not an error, and nobody expects to be able to interact with it like one.

What Is error For?

errors do one thing really, really well. And that’s indicating an error state of a program, and communicating with callers around it.

errors are for callers, and nobody else.

What’s up with wrapping, then? Why can we add context to an error, if the caller doesn’t care about or want the context?

A weaselly answer is that sometimes callers do want more context. The ability to wrap one strongly typed error inside another strongly typed error and let callers pick the level that they want to detect is pretty neat. But that’s weaselly, because most people just fmt.Errorf their way into a new error, adding context a caller can’t reliably read. If the error is for the caller, why is that so popular?

Because third party libraries exist.

Suppose you have a third party library for some SaaS API. It needs to communicate to you that the endpoint you requested wasn’t found.

The caller wants to know that the endpoint wasn’t found (and a retry is unlikely to fix it). The operator wants to know what the endpoint was, to aid in debugging. The third party library knows what the endpoint was, the caller does not. The caller knows how to communicate to the operator, the third party library does not.

In this case, the third party library wraps the “endpoint not found” error in some context that isn’t for the caller, but for the operator: the endpoint that couldn’t be found. It returns that to the caller, to surface to the operator. The caller can unwrap to get what they want, which is the error state they’re in. The caller can report the error to the operator, giving them what they want: the context around that error state, and where in the call stack the third party client was called. The deepest point within the application that the error state occurred in.


These are my beliefs on errors, and on errors. I think my conclusion is that the error type is only for information that you think the caller needs, or need to give the caller so they can report it to the operator because you don’t know how.

Everything else doesn’t really benefit from being an error, but it does suffer for it.