Skip to content

Elite Data Engineering 

I think it’s time to freshen up the world’s perception of Data Engineering.

This article has the dual purpose of raising awareness and acting as a manifesto for all the people at Agile Lab along with our past, present, and future customers.

In the collective imagination, the Data Engineer is someone who processes and makes data available to potential consumers. It’s usually considered as a role focused on achieving a very practical result: get data there — where and when it’s needed.

Unfortunately, this rather short-sighted view led to big disasters in Enterprises and beyond and you know why?

Data Engineering is a practice, not (just) a role.

In the past, many professionals approached the “extract value from data” problem as a technological challenge, architectural at most (DWH, Data Lake, now Cloud-native SaaS based pipelines, etc …), but what has always been missing there is the awareness that Data Engineering is a practice, that as such needs to be studied, nurtured, built and disseminated.

At Agile Lab, we have always dissociated ourselves from this too simplistic vision of Data Engineering, conscious of possible misunderstanding by the market when expecting from us such kind of “basic service”. We’d like to state this once again: Data Engineering is not about getting data from one part to another or executing a set of queries quickly (thanks to some technological help), rather than “supporting” data scientists. Data Engineering, as the word itself says, means engineering the data management practice.

When I explain our vision, I like to use parallelism with Software Engineering: after so many years, it has entered into a new era where writing code has moved way beyond design patterns, embracing governance techniques and practices such as:

  • source version control
  • testing and integration
  • monitoring and tracing
  • documentation
  • review
  • deployment and delivery
  • …and more

This evolution wasn’t driven by naïve engineers with time to waste; instead, it just happened because the industry realized that software has a massive impact, over time, on everything and everyone. A malfunctioning software with poor automation, not secure by design, hardly maintainable or evolvable has an intrinsic cost (let’s call it Technical Debt) that is devastating for the organization that owns it (and, implicitly, for the ones using and relying on it).

To mitigate these risks, we all tried to engineer software production to make it sustainable in the long run, aware of the raise of up-front costs.

When we talk about Engineering, in a broader viewpoint, we aim precisely to this: designing detailed and challenged architectures, adopting methodologies, processes, best practices, and tools to craft reliable, maintainable, evolvable, reusable products and services that meet the needs of our society.

Would you walk over a bridge made of some wooden planks (over and over again, maybe to bring your kids to school)?

Now, I don’t understand why some people talk about Data Engineering when, instead, they refer to just creating some data or implementing ETL/ELT jobs. It’s like talking about Bridge Engineering while referring to putting some wooden planks between two pillars: here the wooden planks are the ETL/ELT processes connecting the data flows between two systems. Let me ask you: is this really how would you support an allegedly data-driven company?

I see a significant disconnection between business strategies and technical implementations: tons of tactical planning and technological solutions but with no strategic drivers.

It’s like if you want to restructure a country’s viability system and you start making some roads or wooden bridges here and there. No, I’m sorry, it doesn’t work. You need a strategic plan, you need investments and, above all, you need to know what you are doing. You need to involve the best designers and builders to ensure that the bridges don’t collapse after ten years (we should have learned something recently). It is not something that can be improvised. You cannot throw workers to do that without proper skills, methodology, and a detailed project while hoping to deliver something of strategic importance.

Fortunately, there has been an excellent evolution around data practice in the last few years, but real-world data-driven initiatives are still a mess. That doesn’t surprise me because, in 2022, there are still a lot of companies that produce poor-quality software (no unit-integration-performance testing, no CI/CD, no monitoring, quality, observability, etc…). This is due to a lack of engineering culture. People are trying to optimize and save the present without caring about the future.

Since the DevOps wave arrived, the situation has moved a lot. Awareness increased, but we are still far from having a satisfactory average quality standard.

Now let’s talk about data

Data is becoming the nervous system of every company, large or small, a fundamental element to differentiate on the market in terms of services offer and automated support to business decisions.

If we genuinely believe that they are such a backbone, we must treat them “professionally”; if we really consider them a good, a service, or a product, we must engineer them. There is no such thing as a successful product that has not been properly engineered. The factory producing such products has undoubtedly been super-engineered to ensure low production costs along with no manufacturing defects and high sustainability.

Eventually, you must figure out if data for your business is that critical and behave accordingly, as usually done with everything else (successful) around us.

I can proudly say that at Agile Lab we believe data is crucial, and we assume that this is just as true for our customers.

That’s why we’ve been trying to engineer the data production and delivery process for years from a design, operation, methodology, and tooling perspective. Engineering the factory but not the product, or vice versa, definitely leads to failure in the real world. It is important to engineer the data production lifecycle and the data itself as a digital good, product, and service. Data does not have only one connotation, instead it is naturally fickle. I’d define it as a system that must stay balanced because, exactly as in quantum systems, the reality of data is not objective but depends on the observer.

Our approach has always been to consider the data as the software (and in part, the data is software), but without forgetting the greater gravity (Data Gravity — in the Clouds — Data Gravitas).

Here I see a world where top-notch Data Engineers make the difference for the success of a data-centric initiative. Still, all this is not recognized and well-identified like software development because just a few people know what a good data practice looks like.

But that’s about to change because, with the rise of Data Mesh, data teams will have to shift gears, they won’t have anyone else to blame or take responsibility in their places, they’ll have to work to the best of their ability to make sure their products are maintainable and evolvable over time. The various data products and their quality will be measurable, objectively. Since they will be developed by different teams and not by a single one, it will be inevitable to make comparisons: it will be straightforward to differentiate teams working with high-quality standards from those performing poorly, and we sincerely look forward to this.

The revolutionary aspect of Data Mesh is that, finally, we don’t talk about technology in Data Engineering! Too many data engineers focus on tools and frameworks, but practices, concepts, and patterns matter also if not more: in fact, eventually technologies are all due to become obsolete, instead of good organizational, methodological, and engineering practices that can long last.

Time to show the cards

So let’s come to what we think are the characteristics and practices that should be part of a good Data Engineering process.

Software Quality:

The software producing and managing data must be high-quality, maintainable, evolvable. There’s no big difference between software implementing a data pipeline and a microservice in terms of quality standards! In terms of workflow, ours is public and open source. You can find it here

Software design patterns and principles must also be applied to data pipelines development. Please stop scripting data pipelines (notebooks didn’t help here).

Data Observability:

Testing and monitoring must be applied to data, as well as to software, because static test cases on artifact data are just not enough. We do this by embracing the principles of DODD (Kensu | DODD). Data testing means considering the environment and the context where data lives and is observed. Remember that data offers different realities depending on when and where it’s evaluated, so it’s important to apply principles of continuous monitoring to all the lifecycle stages, from development through deployment, not only to the production environment.

When we need to debug or troubleshoot software we always try to have the deepest possible context. IDEs or various monitoring and debugging tools help by automatically providing the execution context ( stack trace, logs, variables status, etc.) letting me understand the system’s state at that point in time. 

How often did you see “wrong data” in a table having no idea on how it got there? How exausting was to: hand the piece of code allegedly generating such data, hoping it’s versioned not hardcoded somewhere, tried to mentally re-create the corner case that might have caused the error, searched for unit tests supposedly covering such corner case, scraping the documentation hoping to find references of that scenario… It’s frustrating, because data seems just plunged from the sky, with no traces left. I think 99% of data pipelines implemented out there have no links between the input (data), the function (software) and the output (data) thus making troubleshooting or, even worse, the discovery of an error just a nightmare.

In Agile Lab, we apply the same principles seen for software: we always try to have as much context as possible to understand better what’s happening to a specific data row or column. We try to set up lineage information (stacktrace), data profiling (general context) and data quality before and after the transformations we apply (execution context). All these information need to be usable right when needed and be well organized.

“OK you convinced me: which tool can make all of this possible ?”

Well, the reality is that practice is about people — professionals — not tools. Of course, many tools help in implementing the practice, but without the right mindsets and skills they are just useless.

“But I have analysts creating periodic reports and they can spot unexpected data values and distributions”

True, but ex-post checks don’t solve the problem of consumers having probably already ingested faulty data (remember the data gravity thing?) and they don’t provide any clues about where and why the errors came up. This to say that it makes a huge difference to proactively anticipate checks and monitoring so to make them synchronous with the data generation (and part of applications-generating-data lifecycle).

Decoupling:

It has become a mantra in software systems to have well-decoupled components that are independently testable and reusable.

We use the same measures with different facets in data platforms and data pipelines.

A data pipeline must:

  • Have a single responsibility and a single owner of its business logic.
  • Be self-consistent in terms of scheduling and execution environment.
  • Be decoupled from the underlying storage system.
  • I/O actions must be decoupled from business (data transformation) logic, so to facilitate testability and architectural evolutions.

Also, it is essential to have well-decoupled layers at the platform level, especially storage and computing.

Reproducibility:

We need to be able to re-run data pipelines in case of crashes, bugs, and other critical situations, without generating side effects. In the software world, corner case management is fundamental, just like (or even more) in the data one (because of gravity).

We design all data pipelines based on the principles of immutability and idempotence. Despite the technology, you should always be in the position to:

  • Re-execute a pipeline gone wrong.
  • Re-execute an hotfixed/patched pipeline to overwrite the last run, even if it was successful.
  • Restore a previous version of the data.
  • Run a pipeline in recovery mode if you need to clean up an inconsistent situation. The pipeline itself holds the cleaning logic, not another job.

So far, we have talked about data itself, but even more critical aspect is the underlying environments and platforms status reproducibility. Being able to (re)create an execution environment with no or minimum “manual” effort could really make the difference for on-the-fly tests, no-regression tests on large amounts of data, environment migrations, platform upgrades, switch on/simulate disaster recovery. It’s an aspect to be taken care of and can be addressed through IaC and architectural Blueprint methodologies.

Data Versioning:

The observability and reproducibility issues are rooted in the ability to move nimbly on data from different environments as well as from different versions in the same environment (e.g., rollback a terminated pipeline with inconsistent data).

Data, just like software, must be versioned and portable with minimum effort from one environment to another.

Many tools can help (such as LakeFS), but when they are not available it’s still possible to set up a minimum set of features allowing you to efficiently:

  • · Run a pipeline over an old version of data (e.g., for repairing or testing purposes)
  • Make an old version of the data available in another environment.
  • Restore an old version of the data to replace the current version.

It is critical to design these mechanisms from day zero because they can also impact other pillars.

Vulnerability Assessment:

It is part of an engineer’s duties to take care of security, in general. In this case, data security is becoming a more and more important topic and must be approached with a by-design logic.

It’s essential to look for vulnerabilities in your data and do it in an automated and structured way so to discover sensitive data before they go accidentally live.

This topic is very important at Agile Lab, so much that we have even developed proprietary tools to promptly find out if sensitive information is present in the data we are managing. These checks are continuously active, just like Observability checks.

Data Security at rest:

Fine-grained authentication and authorization mechanisms must exist to prevent unwanted access to (even just portion of) data (see RBAC, ABAC, Column and Row filters, etc…).

Data must be protected, and processing pipelines are just like any other consumer/producer so policies must be applied to storage and (application)users, that’s also why decoupling is so important.

Sometimes, due to strict regulation and compliance policies, it’s not possible to access fine-grained datasets, although aggregates or statistical calculations might still be needed. In this scenario, Privacy-Preserving Data Mining (PPDM) comes in handy. Any well-engineered data pipeline should compute its KPIs with the lowest possible privileges to avoid leaks and privacy violations (auditing arrives suddenly and when you don’t expect it). Have fun the moment you need to ex-post troubleshoot a pipeline that embeds PPDM, which basically gives you no data context to start from!

Data Issues:

Data Observability chapter provides capabilities to perform root cause analysis, but how do we document the issues? How do we behave when a data incident occurs? The last thing you wanna do is to “patch the data”; what you need to do is fixing the pipelines (applications), which will then fix the data.

First, the data issues should be reported just like any other software bug. They must be documented accurately, associating them with all the contextual information we extracted from the data observability, to clarify what would have been expected, what happened instead, what tests have been prepared and why they did not intercept the problem.

It is vital to understand the corrective actions at the application level (they could drive automatic remediation) as well as the additional tests that need to be implemented to cover the issue.

Another aspect to consider is the ability to associate data issues to the specific version of both software and data (data versioning) responsible of the issue because, sometimes, it’s necessary to put in place workarounds to restore business functionality first. At the same time, while data and software evolve independently, we must reproduce the problem and fix it.

Finally, to avoid repeating the same mistakes, it’s crucial to conduct a post-mortem with the team and ensure these are saved in a simil-wiki or somehow associated with the issue, searchable and discoverable.

Data Review:

Just as code reviews are a good practice, a good data team should perform and document data reviews. These are carried out at different development stages and cover multiple aspects such as:

  • Documentation completeness
  • Semantic and syntactic correctness
  • Security and compliance
  • Data completeness
  • Performance / Scalability
  • Reproducibility and related principles
  • · Test cases analysis and monitoring (data observability)

We encourage adopting an integrated approach across code base, data versioning, and data issues.

Documentation and metadata:

Documentation and metadata should follow the same lifecycle of code and data. Computational governance policies of the data platform should prevent new data initiatives to be brought into production if not well paired with the proper level of documentation and metadata. In fact, this aspect in the Data Mesh will become a foundation for having good data products that are genuinely self-describing and understandable.

Having ex-post metadata creation and publishing can introduce a built-in technical debt due to the potential misalignment with the actual data already available to consumers. It’s then good practice to make sure that metadata resides within the same repositories of pipelines’ code generating the data, which should imply synchronization between code and metadata and ownership by the very same team. For this reason, metadata authoring shouldn’t take place on Data Catalogs, which are supposed to help consumers find out the information they need. Embedding a metadata IDE to cover malpractices in metadata management won’t help in cultivating this culture.

Performance:

Performance and scalability are two engineering factors that are critical in every data-driven environment. In complex architectures, there can be hundreds of aspects with an impact on scalability and good data engineers should possess them as foundational design principles.

Once again, I want to highlight the importance of a methodological approach. There are several best practices to encourage:

  • Always isolate the various read, compute, and write phases by running dedicated tests that will point out specific bottlenecks thus allowing better-targeted optimization: when you’re asked to reduce the execution time of a pipeline by 10%, it must be clear which part of the process is worth working on. A warning here: isolating the various phases brings up the theme of decoupling and single responsibility. If your software design doesn’t respect these principles, it won’t be modular enough to allow clear steps isolation. For example, Spark has a lazy execution model where it is very hard to discern read times from transformation and write times if you are not prepared to do so.
  • Always run several tests on each phase with different (magnitudes of) data volumes and computing resources to extract scaling curves that will allow you to make reasonable projections when the reference scenario will change: it’s always unpleasant to find out that input data is doubling in a week and we have got no idea of how many resources will be needed to scale up. Worst case scenario to performance-test pipelines against should be designed with long-term data strategy in mind.
  • Always run several tests on each phase with different (magnitudes of) data volumes and computing resources to extract scaling curves that will allow you to make reasonable projections when the reference scenario will change: it’s always unpleasant to find out that input data is doubling in a week and we have got no idea of how many resources will be needed to scale up. Worst case scenario to performance-test pipelines against should be designed with long-term data strategy in mind.

Wrapping up

If you’ve come this far, you’ve understood what Data Engineering means to us.

Elite Data Engineering is not focusing only on data. Still, it guarantees that the data will be of high quality, that it will be easy to find and correct errors, that it will be simple and inexpensive to maintain and evolve, and that it will be safe and straightforward to consume and to understand.

It’s about practice, data culture, design patterns, broader sensitivity across many more aspects we might be used to when “just dealing with software”. At Agile Lab, we invest every single day in continuous internal training and mentoring on this, because we believe that Elite Data Engineering will be the only success factor of every data-driven initiative.

If you’re interested in other articles on the Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

 

Subscribe to the knowledge base

Receive all our latest articles in your inbox