To understand what is changing compared with the past, it is useful to start by changing the vocabulary. In Data Mesh, we talk more about serving than ingesting, as it is more important to discover and use data rather than extract and load it.
Every movement or copy of the data has an intrinsic cost:
- Development: the ETL must be developed, tested, and deployed
- Maintenance: this is the worst one. You need to monitor such processes, adapt them when the sources are changing, take care of data deletion, dispersion of data ownership.
Often the data movement or copy is needed for the following reasons:
- Technical layers
- Technology needs: you have your data on S3, but SAP requires having the data in an internal table to process it. Or you have a massive dataset on Redshift, and your ML training tool is requiring data on S3.
- No time travel and history capabilities: need to snapshot a data source
Keep in mind that data movement/copy is not data denormalization. Denormalization is quite normal when you have multiple consumers with different needs, but this does not imply a transfer of ownership.
When you move data from a system/team to another, you transfer the ownership, and you are creating dependencies with no added value from a business perspective. Data Mesh transfers data ownership only when data is assuming a new functional/business meaning.
Data Mesh paradigm is also instrumental in “future-proofing” the company when new technologies emerge. Each source system can adopt them and create new connectors to the scaffolding template (we will go deeper on this in the next articles), thus maintaining coherence in providing access to their data for the rest of the company through Mesh Services.
Data Mesh adoption requires a very high level of automation concerning infrastructure provisioning, realizing the “so-called” self-service infrastructure. Every Data Product team should be able to autonomously provision what it needs. Even if teams are to be autonomous in technology choices and provisioning, they cannot develop their product with access to the full range of technologies that the landscape offers. A key point that makes a data mesh platform successful is the federated computational governance, which allows interoperability through global standardization. The “federated computational governance” is a federation of data product owners with the challenging task of creating rules and automating (or at least simplifying) the adherence to such regulations. What is agreed upon by the “federated computational governance” should, as much as possible, follow DevOps and Infrastructure as Code practices.
Each Data Product exposes its capabilities through a catalog by defining its input and output ports. A Data Mesh platform should nonetheless provide scaffolding to implement such input and output ports, choosing technology-agnostic standards wherever possible; this includes setting standards for analytical, as well as event-based access to data. Keep in mind that it should ease and push the internally agreed standards, but never lock product teams in technologies cages. The federated computational governance should also be very open to the change, letting the platform evolve with its users (product teams).
Data Product standardization is the foundation to allow effortless integration between data consumers and data producers. When you buy something on Amazon, you don’t need to interact with the seller to know how to purchase the product or know which characteristics the product has. Product standardization and centralized governance are what a marketplace is doing to smooth and protect the consumer experience.
On this topic, you might be interested in the 10 practical tips to reduce Data Mesh’s adoption roadblocks we designed at Agile Lab, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.