Data Artifacts

An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive 1. These anti-patterns can be identified in management approaches, software design, programming, and probably whenever people come together to achieve anything.

Data Science has its own set of anti-patterns that data scientist and their project managers should be aware of. “Proliferating Data Artifacts” refers to the behaviour of generating datasets which represent various stages of processing. The data scientists often have to wrangling the data into the right shape for analysis. This might include basic things like joining, filtering, mapping and aggregating data as well as any number of complex feature engineering techniques. There is a strong incentive to save the processed data after each processing step to avoid unnecessary re-processing when changes to the code affect only later steps.

An screenshot of Apple Finder with many versions of the same document.

The choice to save intermediate data artifacts can, however, become very costly when multiple versions of the same data exist. We all know this from trying to save word documents as _v1.doc, _v2.doc, etc. A colleague might email back some changes, but renames the document to _v2_John.doc destroying the whole idea of a well-defined genealogy of documents. God forbid anyone ever tries to save a document as _final.doc. Future versions with an ever-growing _final_final_... suffix are all but inevitable.

Unlike text documents, datasets can easily take up gigabytes of space. Multiplied with the number of distinct processing steps and their variations this can grow rapidly to a level where the dreaded ‘out-of-space’ warning hits the poor data scientist. According to Murphy’s law this will always happen just before a critical deadline. In any case, the data scientist often finds it hard to decide which dataset to delete, as he/she might be unsure which one was the latest or if it is still needed. As a consequence a disproportionate amount of time is spent on managing intermediate data artifacts and storage space.

The “Proliferating Data Artifacts” anti-pattern leads to the unnecessary management of intermediate data artifacts and storage space.

Pipelines to the Rescue

But as with every anti-pattern — there is a solution! Pipelines. Pipelines manage intermediate datasets under-the-hood. A pipeline is basically a directed acyclic graph that describes the step-wise processing of the data.

A graph illustrating a data pipeline.

It is easy to see that any data artifact could be recreated by applying all the processing steps (arrows) that lead to it to the preceding data. All that is needed is the input data and a well-defined pipeline.

The pipeline approach has several advantages:

  • The user does not need to track, save, or pass on (large) data artifacts.
  • Unnecessary re-processing of unchanged/unaffected data artifacts is avoided.
  • A pipeline is defined in code and can be versioned.
  • Data artifacts from any pipeline version can easily and reproducibly be re-created.

There are many more advantages that a good pipeline implementation can offer, such as parallel processing of independent parts of the pipeline, being self-contained, restart after failure, and efficient propagation of data changes.

So there’s really no excuse not to use existing pipeline frameworks (or implement them yourself) and delete those proliferating data artifacts! 2


  1. Wikipedia: Anti-Pattern ↩︎

  2. The original version of this text was published 2018-05-09. ↩︎