Skip to content

Decentralized Data Engineering

Centralized Data Engineering

The centralized approach is the most common when a company jumps into the world of data. Hiring a Head of Data ( “been there done that” ) is the first step. She will begin to structure processes and choose a technology to create the data platform. The platform contains all the business data, which are organized and processed according to all the well-known requirements we have in data management.
In this scenario, data processing for analytical purposes occurs within a single technology platform governed by a single team, divided into different functions ( ingestion, modeling, governance, etc.). Adopting a single technology platform generates a well-integrated ecosystem of functionalities and an optimal user experience. The platform typically also includes most of the compliance and governance requirements.
The Data Team manages the platform as well as the implementation of use cases. For those defining principles and practices, it is straightforward to align the team because they have direct control over the people and the delivery process. This methodology ensures a high level ( at least in theory) of compliance and governance. However, it doesn’t typically meet the time-to-market requirements of business stakeholders because the entire practice turns out to be a bottleneck when adopted in a large-scale organization. Business stakeholders ( typically those paying for data initiatives ) have no alternatives but adapting to the data team’s timing and delivery paradigm.

Centralized Data Engineering

Shadow Data Engineering

Shadow Data Engineering

Multi-Tech Silos Data Engineering

By the time Shadow Data Engineering runs for a couple of years, it is no longer possible to return to Centralized Data Engineering. Technologies have penetrated the processes, culture, and users, who become fond of the features rather than their impact.
Centralizing again on a single platform would cost so much and slow down the business. As it is true that the limits of the shadow approach are now clear, the business benefits in terms of agility and speed have been valuable, and none is willing to lose them.
Usually, companies tend to safeguard the technologies that have expanded the most, adding a layer of management and governance on top, creating skill and process hubs for each. Typically, they try to build ex-post an integration layer between the various platforms to centralize data cataloging or cross-cutting practices such as data quality and reconciliation. This process is complicated because the platforms /technologies have not been selected with interoperability in mind.
The various practices will be customized for each specific technology, ending up with a low level of automation.
Anyway, it will be possible to reduce the data debt. Still, the construction of these governance and human-centric processes will inevitably lead to an overall slowdown and effort increase without any direct impact on the business.
Although it will be possible to monitor and track the movement of data from one silo to another, as followed by audit and lineage processes, the information isolation of the various siloes will persist.

Decentralized Data Engineering

There is a massive difference between adapting a practice to a set of technologies and related features and adapting and selecting technologies to fit into a practice.

A clear set of principles must be a foundation when defining a practice.

Policies can be applied to compliance, security, information completeness, data quality, governance, and many other relevant aspects of data management, creating a heterogeneous but trustable ecosystem.

When introducing a new technology, the first driver to value is interoperability ( at the data, system and process level).

It is crucial to select composable services and open formats/protocols.

Data Contract first means that the contract definition is happening during the development lifecycle. Then it automates the creation of resources and standardizes the metadata exchange mechanisms.

This is a direct consequence of a Data Contract first approach. With data contracts, we want to shift the focus to the result that we want to achieve. But data is always the result of a transformation that should be strongly connected with the data contract. The only way to do it is by adopting a declarative semantic integrated with the data contract. Otherwise, you could have pipelines generating information not aligned with the declared data contract. Connecting Data Contracts and transformations guarantees the consistency of schema metadata and lineage information out of the box. There is no time to keep a layer of documentation aligned when we have multiple teams working with high agility and speed. Documentation should be generated directly from the artifacts and related inferred metadata.

So in the decentralized data engineering practice, the platform team must build and maintain an Internal Data Engineering Platform where it should publish a set of pre-built templates ( scaffolds ) and create policies ( as code ) to enforce the process.

I realize that building the Internal Data Engineering Platform requires costs and effort, but they should be supported by a long-term strategical vision. Furthermore, there are no free lunches in the data world.

Decentralized Data Engineering

Thanks to Roberto Coluccio.

 

If you’re interested in other articles on Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Thank you for reading.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

Subscribe to the knowledge base

Receive all our latest articles in your inbox