My name is Data Mesh. I solve problems.
In spite of the innumerable advantages it provides, we often feel that technology has made our work harder and more cumbersome instead of helping our productivity. This happens when technology is badly implemented, too rigid in its rules or with a logic far removed from real world needs or even trying to help us when we don’t need it (yes, autocorrect, I’m talking to you: it is never “duck”!). But we would never go back to having to drive with a paper map unfolded on our lap in the “pre-GPS” days or to browsing the yellow pages looking for a hotel, a restaurant, or a business phone number. We need technology that works and that we are happy to work with. Data Mesh, like Mr. Wolf in Quentin Tarantino’s “Pulp Fiction” solves problems, fast, effectively and without fuss. Let’s take a look to see what it means in practice.
What’s more frustrating than needing something you know you have, but you don’t remember where it is, or if it still works? Imagine that this weekend you want to go to your lake house, where you haven’t been in a long while. There is no doubt you have the key to the house: you remember locking the door on your way out and then thinking long and hard about a safe place where to store it. It may be at home, or even more likely in the bank’s safe deposit. Checking there first, while it wields a high probability of success, means having to drive to the bank, during business hours, and the risk of wasting half a working day for nothing. You can look for the key at home in the evening and it should be faster, but could also take forever if you don’t find it and keep searching, and then if it’s not at home you’d have to go to the bank anyway. You even had a back-up plan: you gave your uncle Bob a copy of the key. The problem is that you can’t remember if you also gave him a new one after you had to change the bolt in 2018…. If you did, Bob’s your uncle, also metaphorically, but are you willing to take the chance, drive all the way to the lake and maybe find out that the old key no longer works? You wanted to go to relax, but it looks like you are only getting more and more stressed out with the preparation, and now you are thinking if it’s really worth it…
That's a real problem
The above scenario seems to only relate to our private lives.
In reality, this is one of the biggest hurdles in a corporate environment, as well, with the only difference being that at home we look for physical, “tangible” objects, while the challenge in a modern company is finding reliable data. It is not just a matter of knowing where the data is, but also if we can really trust it. In our spare time this is annoying, but we can afford this kind of uncertainty. More frequently than we care to admit, we either spend a disproportionate amount of time looking for what we have misplaced, making sure that it still works or fixing it, or we just plainly give up doing something pleasant and rewarding because we can’t be bothered to search for what we need, not knowing how long it will take. In the ever more competitive business world (whichever your business is, the competition is always tough!), avoiding extra expenses and never passing an opportunity are an absolute must. And you can’t afford these missteps “just” because you can’t find or trust your data.
Today it is highly unlikely that a company doesn’t have the data it needs for a report, a different KPI, a new Business Intelligence or Business Analytics initiative, for analysis to validate a new business proposition and so on. We seem, on the contrary, to be submerged by data and to always be struggling to manage it.
New questions arise. The real questions we face are therefore about the quality of data. Is it reliable? Who created that data? Does it come from within the company or has it been bought, merged after an acquisition, derived from a different origin dataset? Has it been kept up to date? Does it contain all the information I need in a single place? Is it in a format compatible with my requirements? What is the complexity (and hence the cost) of extracting the information I need from that data? Does that piece of information mean what I think it means? Is there anybody else in the company who already leveraged that dataset and, if so, how was their experience?
Ok, so there might be problems that need solving before you can use a certain data asset. That, by itself, is par for the course. You know that every new business analysis, every new BI initiative comes with its own set of hurdles. The real challenge lies in the capability of estimating these challenges beforehand, so as not to incur in budget overruns or other costly delays. It seems that every time you try to analyze the situation you can never get a definitive answer from data engineers regarding data quality or even the time needed to determine if the data is adequate. There is not way you can set a budget, both in terms of time and resources, to resolve the issues when you can’t know in advance which issues the data might or might not have, and you can’t even get an answer regarding how long it will take and how expensive it will be to find out. It is downright impossible to estimate a Time-To-Market if you don’t know either what challenges you will face, or when you will become aware of those challenges. Therefore, how can you determine if a product or service will be relevant in the marketplace, when there is no way to know how long it will take to launch it? Should you take the risk, or should the whole project be canned? This is the kind of conundrum that poor data observability and reusability, combined with an ineffective data governance policies put you into. And that is a place where you really don’t want to find yourself.
What companies have just recently come to realize is that the “data integration” problem is, in general, more likely to be tackled as a social, rather than a technical one. Business units are (or at least should be) responsible for data assets, taking ownership both in the technical and functional sense. In the last decade, on the contrary, Data Warehouse and Data Lake architectures (with all their declinations) took the technical burden away from the data owners, while keeping the knowledge over such data in the hands of the originating Business Units. Unfortunately, a direct consequence has been that the central IT (or data engineering team), once they put in place the first ingestion process, “gained” ownership of such data, thus forcing a Centralized Ownership. Here the integration breaks up: potential consumers who could create value out of data must now go through the data engineering team, who doesn’t have any real business knowledge of the data they provide as ETL outcomes. This eventually ends up in the potential consumers not trusting or not being able to actually leverage the data assets and thus being unable to produce value in the chain.
A brand new solution
All the above implies that the time has come for not just as another architecture or technology, but rather a completely new paradigm in the data world to addresses and solve all these data integration (and organization) issues. That’s exactly what Data Mesh is and where it comes into play. It is, first and foremost, a new organizational and architectural pattern based on the principle of domain-driven design, a principle proven to be very successful in the field of micro-services, that now gets applied to data assets in order to manage them with business-oriented strategies and domains in mind, rather than being just application/technology-oriented.
In layman’s terms it means data working for you, instead of you working to solve technical complexities (or around them). Moreover, it is both revolutionary, for the results it provides, and evolutionary, as it leverages existing technologies and is not bound to a specific underlying one. Let’s try to understand how it works, from a problem-solving perspective.
Data Mesh is now defined by 4 principles:
- Domain-oriented decentralized data ownership and architecture
- Data as a product
- Self-serve data infrastructure as a platform
- Federated computational governance.
We already mentioned that one of the most frequent reasons of data strategies failures is the centralized ownership model, because of its intrinsic “bottleneck shape” and inability to scale up. The adoption of Data Mesh, first of all, breaks down this model transforming it into a decentralized (domain-driven) one. Domains must be the owner of the data they know and that they provide to the company. This ownership must be both from a business-functional and a technical/technological point of view, so to allow domains to move at their own speed, with the technology they are more comfortable with, yet providing valuable and accessible outcomes to any potential data consumer.
Trust and budget
“Data as a product” is apparently a simple, almost trivial, concept. Data is presented as if it were a product, easily discoverable, described in detail (what it is, what it contains, and so on), with public quality metrics and availability guarantee (in lieu of a product warranty). Products are more likely to be sold (reused, if we talk about data) if a trust relationship can be built with the potential consumers, for instance allowing users to write reviews/FAQs in order to help the community share their experience with that asset. The success of a data asset is driven by its accessibility, so data products must provide many and varied access opportunities to meet consumer needs (technical and functional), since the more flexibility consumers find the more likely they are to leverage such data. In the current era, data must have time travel capabilities, and must be accessible via different dynamics such as streams of events (if reasonable considering data characteristics), not just as tables of a database. But let’s not focus too much on the technical side. The truly revolutionary aspect is that now data, as a product, has a pre-determined and well-determined price. This has huge implications on the ability of budgeting and reporting a project. In a traditional system, even a modern one like a Data Lake, all data operations (from ingestion, if the data is not yet available, to retrieval and preparation so you can then have it in the format you need) are demanded to the data engineers. If they have time on their hands and can get to your request right away, that’s great for you, but not so much for the company, as it implies that there is overcapacity. This is not efficient and very, very expensive. If they are at capacity dealing with every-day operational needs and the projects already in the pipeline, any new activity has to wait indefinitely (also because of the uncertainties we discussed before) in a scenario where IT is a fixed cost for the company and therefore capacity can not be expanded on demand. Even if it can be expanded, it is nightmarish to determine how much of the added capacity is directly linked to your single new project, how much of the work can or could be reused in the future, how much would have eventually had to be done and so on. When multiple new projects start at the same time, the entire IT overhead can easily become an indirect cost, and those have a tendency to run out of control in each and every company. With Data Mesh, on the other hand, you have immediate visibility of what data is available, how much (and how well) it’s being used and how much it costs. Time and money are no longer unknown (or worse, unknowable in advance).
A game changer
To understand the meaning of the next two points, “Self-serve data infrastructure as a platform” and “Federated computational governance”, let’s draw a parallel with a platform type that “changed the game” in the past. It is an over-simplification, but bear with us. Imagine you work in Logistics and you need to book a hotel for your company’s sales meeting. You have a few requirements (enough rooms available, price within budget, conference room on premise, easy parking) and a list of “nice-to-have” (half board or restaurant so you don’t have to book a catering for lunch, shuttle to and from the airport for those who fly in, not too far from the airport of from the city center). Your company Data Lake contains all the data you need: it contains all the hotels in the city where the meeting will take place. But so did the “Yellow Pages” of yesteryear. How long does it take to find a suitable hotel using such a directory? It’s basically unknowable: if the Abbey Hotel satisfies all requirements, not too long, but good luck getting to the Zephir Hotel near the end of the list. Not only they are sorted in a pre-determined way (alphabetically), but you have to check each one in sequence by calling on the phone if you want to know availability, price and so on. Moreover, the quantity of information available for each hotel directly in the Yellow Pages is wildly inconsistent. It would be nice to be able to rule out a number of them, so you don’t have to call, but some hotel bought big spaces where they write if they have a conference room or a restaurant, other just list a phone number. If you also need to check the quality of an establishment, to avoid sending the Director of Sales to a squalid dump, the complexity grows exponentially, like the uncertainty of how long it will take to figure it out. Maybe you are lucky and find a colleague who’s already been there and can vouch for it, or you’d have to drive to the place yourself. When you budget the company sale’s meeting how do you figure out how much it will cost, in terms of time and money, just finding the right hotel? And if not having an answer wasn’t bad enough, the Sales Director, who’s paying for the convention, is fuming mad because this problem repeats itself every single year and you can never give an estimate, because the time it took you to find a hotel last time is not at all indicative of how long it will take next time.
If you now change the word “hotel” with “data” in the above example, you’ll see that it might be a tad extreme, but it is not so far-fetched. A Data Mesh is, in this sense, like a booking platform (think hotels.com or booking.com) where every data producer becomes a data owner and, just like a hotel owner, wants to be found by those who use the platform. In order to be listed, the federated governance, imposes some rules. The hotel owner must list a price (cost) and availability (uptime), as well as a structured description of the hotel (dataset) that includes the address, category and so forth, and all the required metadata (is there parking? a restaurant? a pool? breakfast is included? and so on). Each of these features becomes easily visible (it is always present and shown in the same place), searchable and can act as a filter. People who stay at the hotel (use the dataset) also leave reviews, which help both other customers in choosing and the hotel owner in improving the quality of its offer or at least the accuracy of the description.
The “self-serve” aspect is two-fold. From a user perspective it means that with such a platform the Sales department can choose and book the hotel directly, without needing (and paying for) the help of Logistics (Data Lake engineers). From an owner perspective (hotel or data owner) it means that they can independently choose and advertise what services to offer (rooms with air conditioning, Jacuzzis, butler service and so on) in order to meet and even exceed the customers’ wishes and demands. In the data world this second aspect relates to the freedom of data producers to autonomously choose their technology path, in accordance with the federated governance approved standards.
Last, but definitely not least, the Data Mesh architecture brings to the table the ease of scalability (once you have all the Hotels/datasets in one city, the system can grow to accommodate those of other cities, as well) and reuse. Reuse means that the effort you spent in creating a solution can, at least in part, be employed (reused) to create another. Let’s stick to the hotel analogy. If you created it last year and now want to do something similar for B&Bs, there is a lot you don’t have to redo from scratch. Of course, the “metadata” will be different (Bed and Breakfast don’t have conference rooms), but you can still use the same system of user feedback, the same technology to gather information on prices and availability, that once again will be up to the B&B’s owner to keep up to date, and so on.
A no brainer?
Put like that, it seems that “going with the Data Mesh” is a no-brainer. And that can be true for large corporations, but do keep in mind that building a Data Mesh is a mammoth task. If you only have three or four hotels, it goes without saying, it doesn’t make sense to build a booking platform. What’s important to keep in mind, though, is that a Data Mesh architecture, to express its full potential, requires a deep organizational change in the company. To cite the most obvious aspect, the data engineers need to “migrate” from the center (the Data Lake) to the data producers, to guide them in the process of properly preparing the data, conforming to the federated governance rules and exposing it correctly so that it can be found and utilized (thus also generating revenue through internal sales for the data owner). It also requires a change of mentality, so that the whole company can start to visualize data as a product, data producers as data owners and break free from the limitations and bottlenecks of a Data Lake, reaping the benefits of a truly distributed architecture and the new paradigm.
Luca Maestri, Chief Financial Officer of Apple, famously said that people tend to attribute the success of huge companies, the likes of Apple, Amazon, Google, or Facebook, to their being creative laboratories, where a great number of innovative ideas can emerge, but his experience taught him that this is not the case. These companies succeed because they are "execution machines”. In other words, a great idea has no value if you can not execute upon it effectively and quickly: on time and on budget. But, first, you need the right tools to be able to determine time and budget constraints. Creating a Data Mesh is a huge undertaking, but it means building the solid foundations that will support the evolution of your data-driven business. You can have all the data in the world into your Data Lake, but if you can’t leverage it, effectively and sustainably, you won’t move forward. Because in today’s world standing still means going backwards, the only way to stay competitive is to create new products, services, and solutions for your customers. In order to be an “execution machine”, you need to be able to spend your time looking for opportunities, instead of, searching for your data, analyzing the marketplace and chasing new clients and upset propositions, instead of rummaging through your Data Lake.
If you can do all of that, you definitely deserve a relaxing, rewarding week-end, to look at the placid lake from your house and remember the time when your Data Lake was equally still, unmoving and hard to see under the surface. Come Monday, it will be time for a new week of effective, efficient, data-driven new business.
Posted by Paolo Platter
CTO & Co-Founder. Paolo explores emerging technologies, evaluates new concepts, and technological solutions, leading Operations and Architectures. He has been involved in very challenging Big Data projects with top enterprise companies. He's also a software mentor at the European Innovation Academy.
On this topic, you might be interested in the How and why Data Mesh is shaping the data management’s evolution, or you can learn more about how Witboost can get your Data Mesh implementation started quickly can get your Data Mesh implementation started quickly.