My Data Engineering Principles

Photo by Dan Clear on Unsplash

My Data Engineering Principles

Data engineering as a field is still relatively new. And with that many tools and people with different backgrounds it started to resemble the wild west somehow. Data engineers yearn for a concrete set of principles and design patterns just like what their software friends have. I believe that the best way to reach those goals is through constant discussion within the community. So, I thought about throwing out my version of the data engineering principles in the discussion and guide any newcomers who might feel a bit lost

Each Pipeline Has A Single Purpose

I like to design pipelines that can be maintained easily, and who wouldn’t right? If your pipeline deals with more than one source/destination you need to think carefully about your situation. I am not saying that a pipeline can’t have more than one source/destination. What I am asking of you is to revise your design. Look at your pipeline and check if it does just one thing or more. What will happen to it in the future if you wanted to remove a functionality from it? Can it be easily done or will you have to rebuild the whole pipeline? What about if you wanted to modify a functionality? Upscale it?

Single Purpose pipelines are easy to maintain, upscale and destroy. The only downside for them is that if you divided a pipeline into two pipeline who share 60% for the logic. then you would have a lot of duplicated components. That’s when the next principle comes in handy

Create Modules For Common Patterns

If you usually read data from databases, and your tool/framework does this in more than one step. You might want to create a module of these steps and reuse it every time you want to read from a database. Almost all tools/frameworks provide at least a way to package a couple of components together to use them in different pipelines, and if yours don’t provide such way feel free to hack something out, even if it is the most spaghetti-coded script the world has ever seen. The modules you create don’t have to be 100% general as in it don’t need to be modified at all for any usecase. Modules that need to be modified for each usecase but still saves a lot of effort works well too.

Modules save time and effort, and at the same time reduce the number of errors as they are well tested, which brings us to the next principle

Test Your Pipeline

I don’t doubt your self-proclaimed genius status, but I am quite sure that testing has saved humanity a couple of times at least. Testing in data engineering is a bit tough and has almost no standards and guides, but you have to do it nonetheless. How to do it is your choice. Perhaps you want to test each component separately, or perhaps you want to test the whole pipeline with a couple of datasets. Two good general rules that would help you whatever your approach is to cover as much corner testcases as possible, as in those cases which don’t happen regularly and are near the limits of your acceptance qualifications, and the other rule is to aim for a code coverage as close to 100% as possible. Try to test all possible routes/branches of your pipeline.

Log Each Step Of Your Processing

Log files are the developer’s best friend. If your tool/framework don’t produce logs for your pipeline and how the data is passing through it, create those logs yourself. If it produces those logs, have another look on how to enrich them as much as you can. Beside errors, warn and info statements are very helpful in knowing what was happening when an issue started.

Create Data Linage And Provenance

Speaking of issues, data lineage could help you as well with them. You would want to know the steps each piece of data went through in your pipeline. This helps you when a piece of data goes through a route that it shouldn't go through. Having timestamps attached to this information can give you an indicator about your pipeline performance.

Create Alerts That Notify And Take Actions

Believe me, you don’t want your data consumers to be the first to notice an issue with the pipeline. Trust is a key component in dealing with other teams in the data field since the amount of data is so huge that no one can validate all of it. Consumers of your data must be able to trust that your pipeline is delivering valid data.

For that, you need to set up alerts on your pipeline. These alerts should not notify you in case of errors but also in case of un-normal behavior. Is it normal that your pipeline don’t receive data for more than 3 hours? Can the volume of your data increase into gigabytes from kilobytes? You need to be careful here not to create any false alarms, too many alerts have the same effect as no alerts.

If you can, I encourage you to try to include actions in your alert. If your target database is down then perhaps your alert can start a backup one? Or run another pipeline to load the data to a S3 bucket? Perhaps at least open a ticket on the database team?

Alert are the first line responders for any issue, you need to arm them the important metrics to monitor and the automated actions to take on the spot

Enable Your Pipeline To Replay Data

Being able to replay the data that were dropped/corrupted because of an issue is a nice feature for your pipeline that would increase its value. Perhaps your pipeline deals with a lot of streaming data that some dropped events will not impact your bussiness and can be ignored safely. But if that's not the case, you would want to think about your replaying strategy. Is your data from a database that you can read again anytime? Can you save the streamed data into a ODS or a staging area before processing it?

You can have a separate pipeline for replaying the data, although having the functionality in the same pipeline is preferred from my point of view to avoid cluttering your environment with too many rarely used pipelines

Save Data At Intermediate Steps Of Your Processing

Even better than replaying the data from the start is replaying it from a certain step. This can save you quite the time and resources when replaying your data, but it would cost you time and resources on the other hand by writing your data to an external storage location at some intervals in your pipeline instead of writing the result just once in the destinations of the pipelines. Again, this depends on the value of your data and the cost of the steps in your pipeline. If all that your pipeline do is just some string manipulations on small pieces of data then I think no intermediate steps are needed, but if it calls an external API or does some serious aggregations then it would be a waste to repeat those expensive steps when they were performed without any issues.

Divide your pipeline into stages and save the output of each stage into an external storage location from which the data can be retrieved by your pipeline easily in case of replaying. Define how long the data should be stored there before marking it as expired and delete it to avoid filling your external storage.