How and why Data Mesh is shaping the data management’s evolution
What is Data Mesh?
Data mesh is not a technology, but it is a practice…a really complex one. It is a new way to look at data, data engineering, and all the processes involved with that, i.e. Data Governance, Data Quality, auditing, Data Warehousing, and Machine Learning. It’s completely changing the perspective on how we look at data inside a company: we are used to looking at them from a technical standpoint, so how to manage them, how to process them, which is the ETL tool, which is the storage….and so on….while it’s time to look at them from a business perspective and then deal with them as they are real assets, let’s say as they are a product. Therefore, in the data mesh practice we need to reverse our way of thinking about data: we tipically see data as a way to extract value for our company, but here instead we need to understand that data IS THE VALUE.
Why is relevant? What problem is going to solve?
If in the past years you invested a lot in Data Lake technologies I have a piece of bad news and a good one.
The good one is: you can understand which are the problems and why Data Mesh is going to solve them.
The bad one is: probably you’ll need to review your data management strategy if you don’t want to fall behind your competitors.
The main problem is related to how Data Lake practice has been defined. At the first stage (I’m talking about the period from 2015 to 2018, at least in the Italian market) it was not a practice, it was a technology: in the specific, it was Hadoop that was the data lake.
It was very hard to manage, mainly on-premise environments, skills on such technologies were lacking. And… there was no practice defined. Think about that first person that coined the “data lake” term, James Dixon, who was the CTO of Pentaho: Pentaho is an ETL tool, so it’s easy to understand that focus was on how to store and how to transform huge quantities of data, not on the process of value extraction or value discovery.
From an organizational perspective, it was a good option to set up a data lake team in charge of centrally managing the technical complexity of the data lake and, when things became more complicated, the same team was in charge of managing different layers (cleansed layer, standardized layer, harmonization layer, serving layer) … all technical ones, obviously.
Then, this highly specialized team in most cases was managing all the requests coming from the business, which was in charge of gaining insights from data, yet without knowledge on the data domains because it always focused on technicalities and it is still relying on operational system teams to gain functional and domain information about the data they are managing. This is representing a technical and organizational bottleneck and this is why a lot of enterprises are struggling in extracting value from data: because the implementation of a single KPI generates a change management storm on all the technical layers of the data lake; also it’s requiring to involve people from operational systems that are not engaged and interested in such process…not their duty. This is a film that is not going to work because there is no alignment in the purpose of all the actors (also because the organization itself is not oriented to the business purpose).
Why is not super popular then? How is going to solve the problems you mentioned before?
Data mesh is putting data at the center. As I said Data is the value, data is the product…so the greatest effort is spent in letting people CONSUME data instead of processing and managing them.
The core idea is that we will have only two stakeholders, who are producing data, let’s say selling data, and who are consuming data, buying them.
And we all know that in a real-world the customer is ruling the market, so let’s think about which are the needs of data consumers:
1. they want to be able to consume all the data of the company in a self-service way. Amazon-like, you browse and you buy, so basically you are consuming products from a marketplace.
2. they don’t want to deal with technical integration every time they need a new data product.
3. they don’t want to spend time trying to understand which are the characteristics of the data, e.g. if you are willing to buy something on Amazon, are you going to call the seller to understand better how the product looks like? NO! The product is self-describing, you must be able to get all the information you need to buy the product without any kind of interaction with the seller. That’s the key, because any kind of interaction is tightly coupled, sometimes you need to find a slot in the agenda, wait for documentation, and so on, this requires time thus destroying the data consumer time to market !!!
4. they want to be able to trust the data they are consuming. Checking the quality of the product is a problem for who is selling, not buying it. If you buy a car, you don’t need to check all the components, there is a quality assurance process on the seller side. Instead, I saw many and many times performing quality assurance in the Data Lake after ingesting data from the operational system, doing the same in the DWH after ingesting data from the lake, and again in BI tool maybe….after ingestion from DWH.
Data Mesh is simply enabling this kind of vision around data in a company. Simple concept, but really hard to build in practice: that’s why it’s still not mainstream ( but it is coming ). It has been coined in 2019 and we are still in an experimental phase with people exchanging opinions and trying it. What is truly game-changing for me is that we are not talking about technology, we are talking about practice!
What is the maturity grade of this practice?
Thoughtworks made an excellent job of explaining the concepts and evangelizing around them but, at the moment, there is no reference physical architecture.
I think that we, as Agile Lab, are doing an excellent job in going from theory to practice and we are already in production with our first data mesh architecture on a large enterprise.
Anyway, the level of awareness around these concepts is still low, in the community we are discussing data product standardization and other improvements that we need to take into account to make this pattern easier to implement. At the moment, it’s still too abstract and it requires a huge experience on data platforms and data processes to convert the concept into a real “Data Mesh platform”.
As happened with microservices and operational systems (Data Mesh has a lot of assonances with microservices), so far the new practice seems elegant from a conceptual standpoint but hard to be implemented. It will require time and effort but it will be worth it.
What is the first step to implement it?
My suggestion is to start identifying one business domain and try to apply the concepts only within these boundaries. To choose the right one, keep in consideration the ratio between business impact and complexity, which should be the highest. Reduce the number of technologies needed for the use case as much as you can (because they have to be provided self-service, mandatory).
In addition, try to build your first real cross-functional team, because the mesh revolution is also going through the organization and you need to be ready for that.
How much time is required for a large organization to fully embrace data mesh?
I think that a large enterprise with a huge legacy should estimate, at least, 3 years of journey to really comprehend and see benefits from it.
What is the hardest challenge in implementing a data mesh?
The biggest one is the paradigm shift that is needed and this is not easy for people working with data for a long time.
The second one is the application of evolutionary architecture principles to data architectures, an unexplored territory…it is not immediate, but if you think about data management systems, the architecture doesn’t change that much until is obsolete and you need a complete re-write.
In the Data Mesh, instead, we try to focus on producer/consumer interfaces and not on the data life cycle. This means that, until we keep this contract safe, we can evolve our architecture under the hood, one piece at a time and with no constraints.
Another challenge is the cultural change and the ways you can drive it in huge organizations.
Is there any way to speed up the process?
From a technical standpoint, we think that the key is to reach a community-driven standardization and interoperability of the data product as architectural quantum. Because after this it will be possible to build an ecosystem of products that will speed-up the process.
Anyway, the key components to create a data mesh platform are :
- Data Product Templates: to spin-up all the components of a DataProduct, with built-in best practices
- Data Product Provisioning Service: all the DataProduct deployments should go through a unique provisioning service in charge to apply computational policies, create the underlying infrastructure and feed the marketplace
- Data Product Marketplace: a place where is easy to search and discover Data Products, not just data.
Author: Paolo Platter, Agile Lab CTO & Co-Founder