Error Handling: Expecting the Unexpected

While designing any system, one must prepare an error handling mechanism. This is even more critical when you are designing a data pipeline. With data being one of the most abandon assets out there, it can come with lots of issues. As well as the data pipelines usually being complicated systems working with many other systems.

Of course, depending on the tool that you are using to build that pipeline. You might have pre-existing options for handling errors and exceptions. You should use them. What we call for is for you to know what can go wrong, and how to handle it. Use what is presented out of the box and add to it if it doesn’t cover all your needs.

As a general classification, we can divide the errors into two kinds. Errors that stop the pipeline completely, and errors that indicates a corrupt message, table, bucket or time interval but the pipeline continue to operate normally. The first kind by nature stops the pipeline from running, but it is the second kind that we need to shed some light upon. You need to ask yourself if faced by an error if the second kind, do you need to stop the pipeline or is it ok for it to keep running? and for the corrupted data, will it be ignored? Will it be re-processed? If yes then when? All these questions need answering before building the pipeline.

Furthermore, sometimes you are faced with some issues that are not typical errors, but in the specific case you are handling it is considered an error. What are some of these issues that can happen in a pipeline? Only you can answer that question. Remember that even the most common issue could not be an issue in a special case. For example, a message that doesn’t comply to a certain schema might not be an issue if you have no schema registry and your sources have the liberty of introducing new schemas according to the evolving bussiness needs. Another example is that the absence of messages in a time interval or the absence of data from a certain source is not an issue if it can be explained by the bussiness logic.

So, in the design phase of you pipeline, you need to list all the problems that could be considered issues in you case. But can you actually list all the problems that you could face? Probably not. Always assume that you will face some errors that you have not expected. In that case, you need a more general error catching mechanism. Think of something like catch(Exception e). In some cases that would be possible, in other cases that general catch is not possible as well. Faced with these limitations. Your only hope is to set alerts and checks on the most critical parts of the pipeline. These are the parts that if faced by errors, would compromise the whole pipeline. The belly of the beast if you may call it that. While this doesn’t catch all the errors, it limits their effect greatly.

That last point introduced us into two very important concepts regarding error handling. Alerting and weighting errors by effect. Let’s discuss the latter first.

Not all the errors have the same urgency, effect or importance. Some errors are so trivial that it would be a waste of time and effort to even set up catches and alerts for them. Others directly affect the core bussiness and are considered show stoppers. While listing the expected errors, divide them into levels of based on both urgency and effect. And when you discuss how to prevent/detect/catch an error, have a look at its rank and ask yourself if it is worth the effort you are about to put in.

The other concept is alerting. When your pipeline detects an error or an issue, what should it do? You need to provide an alerting mechanism to notify you of the errors. It is advised to have multiple channels for errors with different urgency. Not all errors should wake you up at 2 am, nor all of them you should just casually stumble upon when randomly check the logs every 2 weeks or so.

After your pipeline successfully catches an error or an issue, what is the next step? Sometimes alerting is all that we can do, but other time we can automate the recovery process. Depending on an external tool or on the pipeline itself to receive the alert and take some action based on it. That action can be backing up some important data, running an additional step or simply retrying the same step that caused the error.

Perhaps the most important factor in error handling is the pipeline design itself. Having a design that provides things like message replaying and regular checkpoints can reduce greatly the effect of any error and at the same time increase the ability to recover from it as fast as possible.

A summary of what we discussed can be written as a group of question you should ask yourself during the designing phase of your pipeline:

What are the errors that might occurs?
What are the issues that might happen but at the same time will not raise an error?
What is the urgency and effects of these errors and issues?
Which of these errors could stop the pipeline completely?
Which of these issues can be ignored safely?
How can automate the recovery of the pipeline from some of these errors and issues?
What are the channels that the alerts will be propagated through?
How can I modify the pipeline’s design to improve the ability to recover from these errors?