Data mesh and monoliths integration patterns

Data mesh and monoliths integration patterns

With the advent of data mesh as a sociotechnical paradigm to decentralize data ownership and make data-driven companies scale out fast, many businesses face organizational and technical issues adopting data mesh for several reasons.

This article focuses on the technical issues of a company that is in the stage of breaking data monoliths to enable inbound and outbound integration with the data mesh. I am going to discuss integration patterns that can address several organizational issues to achieve necessary tradeoffs among the involved stakeholders of a data mesh initiative.

A data mesh makes sense most of all for huge enterprises. The bigger a company is, the clearer is where all ropes of the data value chain start experiencing too much tension to fit in the current organization. 

Adopting data mesh is far from being a mere technological problem, instead, it is very clear that many issues occur in the organization of a company. This does not exclude that current technologies are real concerns and impediments to the adoption of the data mesh paradigm.

Anyhow, looking positively at the future of a data mesh initiative, after removing major roadblocks to communication and technical silos in the principal folds of the enterprise, and having built the foundation for a data mesh platform, we may allegedly think of working in bimodality for a long while. The need for innovation in fast-paced enterprises will bring vendors of technological monoliths to move closer to the new paradigm. This may change the landscape.

 

A new data mesh will live side by side with data monoliths for a long period.

Nowadays, on the one side, we may see growing new data products in the data mesh ecosystem leveraging the self-serve capabilities of the platform. On the other side, we may see data mesh adopters striving to integrate data from the several necessary silos (operational and analytical) that realistically still exist in any big enterprise. 

Closing the gap between operational and analytical sides

Domain Driven Design has been around for a while shaping microservice architectures through this paradigm in many companies. Data Mesh inherits and extends those concepts to the analytical plane embedding best practices from DevOps, product thinking, and system thinking. Here is an opportunity to close the gap between the needs for operational and analytical scenarios.

It should be encouraged to holistically see a source data product as a direct extension of DDD-ed entities. The same team working on microservices may also deal with analytical scenarios opening to data engineers, data scientists, and ML engineers. 

The data mesh platform team should enable easy integration between analytical and operational planes by providing as many as possible out-of-the-box data integration services to be autonomously instantiated as part of the data product workloads by the domain. 

A domain-driven microservice architecture can close the gap with the analytical plane.

From operational silos to data mesh

Big enterprises are not homogeneous, there are many silos of both operational and analytical natures, and many of the operational ones will never move realistically to a microservice architecture because of the complexity to remove technological silos some key business capabilities rely on. This is common for CRM and ERP legacy operational systems. They are typically closed systems that enable inbound integrations and restrict outbound ones. Often happens that analytical access to those monoliths is difficult in terms of data integration capabilities. The extraction of datasets from the silos is anyway required to enable analytical scenarios within the data mesh. 

The data mesh platform either enables native integration with the data silos embedding extraction functionalities within the data product as internal workloads or should be open to customization and move the burden of the integration to the domain/data product team.

Legacy systems preserving domain entities

Internal workloads to extract data from operational silos. This is the case for a monolith that can extract a dataset representative of a DDD-ed aggregate.

Enabling native integration with an external operational silo would be the recommended way since those legacy systems usually can impact many domains and data products. This is particularly beneficial if the monolith enables the extraction of a dataset representative of the data model for the data product.

Silos enabling direct integration

Such legacy systems are often generic and offer data modeling customization on top of a hyper-normalized legacy data model. Bulk extractions may or may not be able to address the customer’s data model so data products may need to deal with the underlying legacy normalized data model of the monolith. Further, it is not said that the customer’s data model fits into the domain-driven design given by the data mesh.

In this case, every data product should offload potentially many normalized datasets and combine them to get consistency with the bounded context. This might be repeated for every data product of each domain impacted by the monolith.

Silos not enabling direct integration

The previous case could be not the best compromise in the case where there is no easy direct integration with the external operational monolith and an intermediate exchange layer (sort of L0-only data lake) can be of help. This exchange layer might be operated by the same team of the legacy system keeping the responsibility of the quality of this data on the source side. The purpose of the exchange layer would be the minimization of the effort to offload data from the operational monolith in a shape that enables integration with the data mesh platform. The extraction of those data is governed by the data silos team and could be operated through traditional tools in their classical ecosystem. Later we will discuss better how to deal with analytical data monoliths (including such an exchange layer).

This connection is a weak point in the data mesh since the exchange layer tends to be nobody’s land, thus has to be considered the last chance. As usual, data ownership and organization can make a difference. 

Strangler pattern

A final option I would consider to integrate operational silos into the data mesh platform is possible if there is a convergence between the data mesh platform and the operational platform like in the first described case. 

The operational platform is the set of computational spaces that runs online workloads and is operated through best-of-breed DevOps practices and tools.

We may reflect the bounded context from an operational point of view into the operational platform by applying the strangler pattern. This may be particularly useful if the enterprise wants to get rid of the legacy system and applies to the case of internally developed application monoliths.

Depending on the case, a company may think to decompose the monolith by creating the interfaces first through the implementation of facades. After having domain-driven views of the monolith, the data mesh platform shall have the same integration opportunities already in place with the operational platform. Facades can be converted then into microservices replacing parts of the application monolith.

The strangler pattern enables domain-driven decomposition of application monoliths.

From analytical monoliths to data mesh

In the previous discussion, I analyzed some of the common cases where operational silos require integration with the data mesh platform and data products. In this section, I want to address the integration with existing analytical data monoliths that must be integrated with the data mesh. This may happen for several reasons: removing old monoliths or making the data mesh coexist with those monoliths. 

Data monoliths can have several shapes. They may be in the form of custom data lakes or lakehouses, open source-based data platforms, legacy data platforms, DWH systems, etc. The recurring issue with those monoliths is the ease to bring data in and the challenge to extract data. 

Building technical data products is not sensical. Thus, the purpose of this exercise is to allow the data mesh to leverage existing monoliths and enable analytical scenarios.

In this scenario, we are going to build both source or consumer data products depending on whether data from the silos reflects data sources (operational systems) that are not still accessible from the data mesh platform or analytical data assets that must be shaped into data products until they cannot be directly produced within the data mesh.

Strangler pattern for data monoliths

The data mesh can be fed through extraction or serving. In the extraction case, the monolith produces a dataset or events corresponding to the bounded context of the respective data product. In the serving case, the monolith prepares a data view or events that serve data through a proper connector. Serving is generally preferred to extraction since reduces the need for copying data. Anyway, a compromise between performance, maintenance, organization, and technological opportunities must be taken into account to take the right choice.

Extraction and serving are preparation steps located within the original boundaries of the monolith.

The ownership of those extraction and serving steps should be of the team that manages the data silos. 

Extraction and serving apply to both streaming and batch scenarios to address bulk and incremental operations, scheduled or NRT. In streaming scenarios, a data silo serves events whenever the event store resides by that side. For instance, the traditional data platform provides CDC+stream processing+event store to build facades compliant with data products that can subscribe to and consume those events. On the contrary, the extraction occurs if the event store is part of the data product and the traditional data platform uses a CDC to produce ad hoc events for a data product.

Datasets and data views can be seen as facades that can enable the strangler pattern for the data monolith until the same data product can be replaced within the data mesh platform itself.

This is essential to limit the impact of the data integration on the domain teams. Of course, managing such boundaries require governance and should guarantee a certain level of backward compatibility and maintenance.

Data contracts

To regulate the exchange from data monolith to data products, the two parties (domain and data silos organization) should land on a data contract including:

  • Formats, naming conventions, data stores, etc.
  • Immutable schemas
  • Schema evolution mechanisms

In fact, data schemas must be immutable to guarantee backward compatibility. Also, having the team managing the data silos responsible for data views and datasets dedicated to the data mesh integration should make easier the identification of data incidents.

A schema evolution mechanism is necessary to make the data mesh evolve without having an impact on consumer data products depending on the data product integrated with the external silos.

Input ports

A further improvement can be the standardization of those ingestion mechanisms. If the need to integrate external data silos with data products is recurrent, the federated governance may detect the opportunity for standardization and require the implementation of standard interfaces to get data from those data silos. This would make those mechanisms available to all data products as self-served internal workloads to leverage the integration.

Input ports can work in push or pull mode depending on the case. It is important for the data product to protect its own data from side effects regardless of the way this data integration happens. That is, pushing data from outside should not imply losing control of the quality provided by the output ports. 

Finally, an input port is worth working both internally and externally to the data mesh platform. In fact, it may represent the means to receive data that are necessary to build the data product.

From the data mesh to monoliths

A data mesh may need to coexist with other data monoliths for a long last or forever. This means that the data mesh is not an isolated system and data must be able to flow outbound as well. This is obvious in a transition phase where (for instance) data products built within the data mesh platform must feed preexisting ML systems that still live outside the data mesh.

Boundary ports

In this case, we might preserve output ports for the data mesh without wasting their semantics for technical reasons. Instead, we may have some boundary ports that are specifically used to enable this data exchange and deal with external monoliths. A boundary port should not be an input for other data products and the data mesh platform should deny data products consuming from boundary ports.

Boundary ports can serve as input for external data monoliths.

As in the case of analytical monoliths to data mesh, boundary ports may be subjected to standardization by the federated computational governance. The data mesh platform can facilitate those data integration patterns for those cases where the data monolith looks long-standing or cannot be removed at all. 

The design of boundary ports can be driven by the technological needs of outside systems. Being technical plugs, boundary ports may leverage the technical capabilities of the data mesh platform to adapt to external data consumption interfaces. The ability to adapt to external systems is key for several reasons. For instance, the agility of the data mesh platform development can enable boundary ports that accelerate the decomposition of those monoliths by facilitating that integration. It would be more difficult chasing legacy system capabilities to speed this process up. This relieves effort from slow-changing organizational units bringing this burden where this can be resolved faster. Finally, keeping this effort close to the data product retains the responsibility of the data quality within the data product.

Agility

The integration patterns that we have presented here want to enable agility. Breaking the external monolith may be very cumbersome and costly. There is a need for an incremental and iterative approach to onboard source and consumer data products: 

Data mesh—monolith integration patterns can bring agility in the management of the adoption phase.

  • Source data products should have precedence since they can be used to build consumer data products. Thus, integration patterns that work on operational systems can help to ease the adoption of source data products.
  • Sometimes operational systems are not immediately available due to technical challenges. This may move the attention to data monoliths first that contain copies of the original operational data. Depending on where to move the effort of integration, it is possible to implement serving and extraction mechanisms. For a monolith consuming from the data mesh platform, there are two ways to be fed: output ports and boundary ports. Output ports require no extra effort for the data mesh platform since they are natively available moving this burden towards external systems. Boundary ports bring some backlog in the platform product to facilitate whereas external legacy systems are weaker. The choice may depend on many factors including organization, culture, technologies, etc.
  • Consumer data products could be necessary even before source data products are available. They could enable some business capabilities essential to the company. In this case, a domain can anticipate the onboarding of such data products relying on external data monoliths during the transition. In this case, consumer data products will open some input ports to get data from those data silos. After the related source data products will be created within the data mesh. Thus, input ports can be addressed to those source data products
  • In the case of legacy systems that are not going to be replaced, permanent data paths must be established relying on output and boundary ports for outbound integration and input ports for inbound integrations.

Use cases

The integration patterns between a data mesh and the surrounding application and data monoliths can emerge to be necessary in many cases that we are going to explore here. Besides those patterns, it is important to focus on organizational issues most of all in regard to data ownership in the gray areas of the company. Setting ownership of data and processes during a transition or across faded boundaries of an organization is essential to the success of the evolution and maintenance of any initiative. The data mesh is not different in that.

Data mesh adoption 

During a data mesh initiative, the cooperation between existing legacy systems, and operational and analytical silos is crucial. Working in bimodality for the period of the adoption is essential to migrate to the data mesh in agility.

Change management is key to success and having solid integration patterns that can assist the evolution can reduce risks and frictions.

A data mesh initiative is realistically rolled out in years from inception to industrialization. The adoption must be agile. It is neither possible nor convenient to invest years in developing a data mesh platform before it is used by the company. This means that a disruptive adoption plan requires all stakeholders to be onboarded at the right pace. Whether you are in the food and beverage, banking and insurance, utility and energy industries, fast-growing data-driven companies must pass through a coarse organization of the adoption and then refine it along the way. 

This means identifying first a champion team that can embrace the change at a rapid pace. In parallel, it is necessary to start soon addressing the onboarding of parts of the company relying on application and data monoliths. They require much more time to be embodied and the adoption will face sooner or later the necessity to resolve the data integration problem.

On the one hand, the adoption requires identifying all existing monoliths that will remain forever and determining inbound and outbound permanent data paths to enable analytical scenarios depending on those silos.

On the other hand, the adoption would determine which monoliths must be broken. In this case, the strangler pattern can help us divide and conquer the acquisition of those data assets transforming them into data products.

Merge & Acquisition deals

Another common case in the data mesh path is dealing with M&As of other companies. The data mesh should reduce acquisition risks and accelerate the value chain coming from acquired data assets. This is the promise of the data mesh to ease the business of data-driven companies. 

In this scenario, opening data paths between the involved parties can be facilitated through the adoption of integration patterns that assist IT teams from all sides during the transition. 

Of course, M&As can reach another degree of complexity given that, even in the case of a mature data mesh, some of the parties could have no familiarity with DDD, data mesh, and any of those concepts posing many organizational and cultural issues to be considered.

In this sense, I would argue that in the case of an M&A there are major issues that need a reconciliation effort between the parties. Data assets and organizational units in charge of operating data will sooner or later meet the necessity to integrate with the existing data mesh. This will also occur for online systems. Having tested integration patterns to rely on can relieve some design effort from the bootstrap. The precedence between online systems and analytical ones will be driven by factors out of the only data mesh doctrine. Anyway, at a certain point, data assets will start moving towards the data mesh platform and solid integration patterns can help break monoliths from all parties. 

Conclusions

This article presented the integration patterns between operational and analytical monoliths with a data mesh. It reviewed the Strangler pattern to break data monoliths through the extraction (datasets) and serving (data views) mechanisms governed by data contracts. The boundary ports represent the opportunity to formalize temporary or permanent data paths from the data mesh platform to external data silos. Finally, I’ve exposed two primary use cases where to apply those integration patterns: data mesh adoption and M&A events to establish temporary and permanent data paths. The advantage of this approach is to keep data ownership as close as possible to the data source, enable integration patterns to facilitate the construction of data products depending on external legacy systems, and have a sustainable path to iteratively and incrementally break data silos that must be embodied within the data mesh.

If you’re interested in other articles on the Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Thank you for reading.

Author: Ugo Ciracì, Agile Lab Engineering Leader

Decentralized Data Engineering

Decentralized Data Engineering

Centralized Data Engineering

The centralized approach is the most common when a company jumps into the world of data. Hiring a Head of Data ( “been there done that” ) is the first step. She will begin to structure processes and choose a technology to create the data platform. The platform contains all the business data, which are organized and processed according to all the well-known requirements we have in data management.
In this scenario, data processing for analytical purposes occurs within a single technology platform governed by a single team, divided into different functions ( ingestion, modeling, governance, etc.). Adopting a single technology platform generates a well-integrated ecosystem of functionalities and an optimal user experience. The platform typically also includes most of the compliance and governance requirements.
The Data Team manages the platform as well as the implementation of use cases. For those defining principles and practices, it is straightforward to align the team because they have direct control over the people and the delivery process. This methodology ensures a high level ( at least in theory) of compliance and governance. However, it doesn’t typically meet the time-to-market requirements of business stakeholders because the entire practice turns out to be a bottleneck when adopted in a large-scale organization. Business stakeholders ( typically those paying for data initiatives ) have no alternatives but adapting to the data team’s timing and delivery paradigm.

Centralized Data Engineering

Shadow Data Engineering

Shadow Data Engineering

Multi-Tech Silos Data Engineering

By the time Shadow Data Engineering runs for a couple of years, it is no longer possible to return to Centralized Data Engineering. Technologies have penetrated the processes, culture, and users, who become fond of the features rather than their impact.
Centralizing again on a single platform would cost so much and slow down the business. As it is true that the limits of the shadow approach are now clear, the business benefits in terms of agility and speed have been valuable, and none is willing to lose them.
Usually, companies tend to safeguard the technologies that have expanded the most, adding a layer of management and governance on top, creating skill and process hubs for each. Typically, they try to build ex-post an integration layer between the various platforms to centralize data cataloging or cross-cutting practices such as data quality and reconciliation. This process is complicated because the platforms /technologies have not been selected with interoperability in mind.
The various practices will be customized for each specific technology, ending up with a low level of automation.
Anyway, it will be possible to reduce the data debt. Still, the construction of these governance and human-centric processes will inevitably lead to an overall slowdown and effort increase without any direct impact on the business.
Although it will be possible to monitor and track the movement of data from one silo to another, as followed by audit and lineage processes, the information isolation of the various siloes will persist.

Decentralized Data Engineering

There is a massive difference between adapting a practice to a set of technologies and related features and adapting and selecting technologies to fit into a practice.

A clear set of principles must be a foundation when defining a practice.

Policies can be applied to compliance, security, information completeness, data quality, governance, and many other relevant aspects of data management, creating a heterogeneous but trustable ecosystem.

When introducing a new technology, the first driver to value is interoperability ( at the data, system and process level).

It is crucial to select composable services and open formats/protocols.

Data Contract first means that the contract definition is happening during the development lifecycle. Then it automates the creation of resources and standardizes the metadata exchange mechanisms.

This is a direct consequence of a Data Contract first approach. With data contracts, we want to shift the focus to the result that we want to achieve. But data is always the result of a transformation that should be strongly connected with the data contract. The only way to do it is by adopting a declarative semantic integrated with the data contract. Otherwise, you could have pipelines generating information not aligned with the declared data contract. Connecting Data Contracts and transformations guarantees the consistency of schema metadata and lineage information out of the box. There is no time to keep a layer of documentation aligned when we have multiple teams working with high agility and speed. Documentation should be generated directly from the artifacts and related inferred metadata.

So in the decentralized data engineering practice, the platform team must build and maintain an Internal Data Engineering Platform where it should publish a set of pre-built templates ( scaffolds ) and create policies ( as code ) to enforce the process.

I realize that building the Internal Data Engineering Platform requires costs and effort, but they should be supported by a long-term strategical vision. Furthermore, there are no free lunches in the data world.

Decentralized Data Engineering

Thanks to Roberto Coluccio.

 

If you’re interested in other articles on Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Thank you for reading.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

Data Product Specification: power up your metadata layer and automate your Data Mesh with this practical reference

Data Product Specification: power up your metadata layer and automate your Data Mesh with this practical reference

Just like every informative content in the Data Mesh area should do, I’d like to start quoting the reference article on the topic, by Zhamak Dehghani:

[…] For data to be usable there is an associated set of metadata including data computational documentation, semantic and syntax declaration, quality metrics, etc; metadata that is intrinsic to the data e.g. its semantic definition, and metadata that communicates the traits used by computational governance to implement the expected behavior e.g. access control policies.

A “set of metadata” is hereby to be associated with our Data Products. I’ve read a lot of articles (I’m gonna report some references along the way in the article) about why it’s important and what theoretically can be achieved by leveraging them… they all make a lot of sense. But — as frequently happens when someone talks about Data Mesh related aspects — there’s often a lack of “SO WHAT”.

For this reason, I’d like to share, after a brief introduction, a practical approach we are proposing at Agile Lab. It’s open-source, it’s absolutely not perfect for every use case, but it’s evolving along with our real-world experience of driving enterprise customers’ Data Mesh journeys. Most importantly, it’s currently used — in production, not just on articles — at some important customers of ours to enable the automation needed by their platform supporting their Data Mesh.

Why metadata are so important

I believe that automation is the only way to survive in a jungle of domains willing to create Data Products. Automation with respect to:

  1. easy and fast creation of Data Products leveraging pre-built reusable templates that guarantee out-of-the-box compliance to the Data-as-a-Product principle;
  2. implementation and execution of the Federated Governance policies (which therefore become “computational”);
  3. enablement of Self-Service Infrastructure, as a fundamental pillar to provide Decentralized Domains the autonomy and ownership they need.

The pillars together are all mandatory to reach the scale. They are complex to set up and spread, both as cultural and technical requirements, but they are foundational if we want to build a Data Mesh and harvest that so-widely promised value from your data.

But how can we get to this level of automation?
By leveraging a metadata model representing/associated with a Data Product.

Such a model must be general, technology agnostic, and imagined as the key enabler for the automation of a Data Mesh platform (to be intended as an ecosystem of components and tools taking actions based on specific sections of this metadata specification, which must be standardized to allow interoperability across the platform’s components and services that need to interact with each other).

State of the art

No clear sight due to an Atlantic perturbation — 2014, Madeira Island.
Photo by Roberto Coluccio — all rights reserved.

In order to provide a broader view on the topic, I think it’s important to report some references to understand what brought us to do what we did (OK, now I’m creating hype on purpose 😁).

  • Feed the Control Ports: Arne Roßmann in this article describes how control ports of Data Products should serve foundational APIs by leveraging metadata collected at different levels.
  • Active Metadata: Prukalpa Sankar in this article mentions a commercial product tackling the active metadata space as a way to autonomously collect precious pieces of information about several things from SQL plans to understand data lineage, to monitor Data Products update rate, and other very cool stuff.
  • Metadata types: Zeena in this post introduces their Data Catalog solution after reporting a very interesting differentiation of metadata into DESCRIPTIVE, STRUCTURAL, ADMINISTRATIVE.
  • Standards for Metadata in generalOpenMetadata is doing excellent work in creating standards and a framework for metadata management, including specific vocabulary and syntax (even if in some cases it becomes quite complex IMHO).
  • General Data Product definition: what gets closer to our vision is OpenDataProducts, a good work to factor out some important information around the general Data Product concept (e.g. Billing and SLA). However, generalizing too much is risky and, in this case, I believe they got a little too far from the way Data Mesh requires a Data Product to be described and built. For example, there’s no mention of Domains, Schema, Output Ports, or Observability.

According to these references, there’s still no clear metadata-based representation of a Data Product addressing — specifically — the Data Mesh and its automation requirements.

How we intend a data product specification

We believe in an Infrastructure-As-Code declarative, idempotent, and versioned approach. The goal is to have a standardized yet customizable specification addressing this holistic representation of the Data Mesh’s architectural quantum that is the Data Product.

Principles involved:

  • Data Product components (input ports, output ports, workloads, staging areas)
  • self-documenting Data Product (from schema to dependencies, to description and access patterns)
  • clear ownership (from Domain to Data Product team)
  • semantic linking with other Data Products (like polysemes to help improve interoperability and references to build a semantic layer or Knowledge Graph on top of the Data Mesh)
  • data observability
  • SLA/SLO and Data Contracts (expectations, agreements, objectives)
  • Access and usage control

The Data Product Specification aims to gather together crucial pieces of information related to all these aspects, under the strict ownership of the Data Product Owner.

What we did so far

In order to fill up this empty space, we tried to create a Data Product Specification by starting this open-source initiative:

https://github.com/agile-lab-dev/Data-Product-Specification

The repo contains a detailed documentation field-by-field; however, I’d like to point out here some features I believe to be important:

  • One point of extension and — at the same time — standardization is the expectation that some of the fields (like tags) are expressed in the OpenMetadata specification.
  • Depending on the context, several of those fields can be considered optional: their valorization is supposed to be checked by specific federated governance policies.
  • Although there’s an aim for standardization, the model is expandable for future add-ons: your Data Mesh (and platform) will probably keep evolving for a long time and the Data Product Specification is supposed to do the same, along the way. You can start simple, then add pieces by pieces (and related governance policies).
  • being in YAML format it’s easy to parse and validate, e.g. via a JSON parsing library like circe.

The Data Product Specification itself covers the main components of a Data Product:

  • general bounded context and domain metadata the Data Product belongs to (to support descriptive API on control ports)
  • clear ownership references
  • input ports, workloads, internal storage/staging areas, output ports, and logical dependencies between them
  • data observability
  • data documentation, semantic linking info, schema, usage guidelines (from serialization format and access patterns to billing models)

NOTE: what is here presented as a “conceptual” quantum can be (and it is, in some of our real-world implementations) split into its main componing parts that are then git-version controlled under their own repositories (belonging to a common group, which is the Data Product one).

Key benefits

The Data Product Specification is intended to be heavily exploited by a central Data Mesh-enabling platform (as a set of components and tools) for:

  • Data Products creation boost, since templates can be created out of this specification (for the single components and for the whole Data Products). They can then be evolved, also hiding some complexity from end-users.
  • Federated Governance policies execution (e.g. a policy statement could be “the DP description must be at least 300 chars long, the owner must be a valid ActiveDirectory user/group, there must be at least 1 endpoint for Data Observability, the region for the S3 bucket must be US-East-1”, etc…), thus making “approved templates” to be compliant by-design with the Data Mesh principles and policies.
  • Change Management improvement, since specifications are version controlled and well structured, thus allowing to detect breaking changes between different releases.
  • Self-service Infrastructure enablement, since everything a DP needs to be provisioned and deployed for is included in a specific and well-known version of the DP specification, thus allowing a platform to break it down and trigger the automated provisioning and deployment of the necessary storage/compute/other resources.

As I said, this specification is evolving and can be surely improved. Being open-source, every contribution is welcome.

Wrapping up

How can a Data Mesh journey reach the scale? With automation to guarantee out-of-the-box compliance of Data Products to the Data Mesh principles. 

If you’re interested in other articles on the Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Thank you for reading.

Author: Roberto Coluccio, Agile Lab Data Architect

Elite Data Engineering

Elite Data Engineering 

I think it’s time to freshen up the world’s perception of Data Engineering.

This article has the dual purpose of raising awareness and acting as a manifesto for all the people at Agile Lab along with our past, present, and future customers.

In the collective imagination, the Data Engineer is someone who processes and makes data available to potential consumers. It’s usually considered as a role focused on achieving a very practical result: get data there — where and when it’s needed.

Unfortunately, this rather short-sighted view led to big disasters in Enterprises and beyond and you know why?

Data Engineering is a practice, not (just) a role.

In the past, many professionals approached the “extract value from data” problem as a technological challenge, architectural at most (DWH, Data Lake, now Cloud-native SaaS based pipelines, etc …), but what has always been missing there is the awareness that Data Engineering is a practice, that as such needs to be studied, nurtured, built and disseminated.

At Agile Lab, we have always dissociated ourselves from this too simplistic vision of Data Engineering, conscious of possible misunderstanding by the market when expecting from us such kind of “basic service”. We’d like to state this once again: Data Engineering is not about getting data from one part to another or executing a set of queries quickly (thanks to some technological help), rather than “supporting” data scientists. Data Engineering, as the word itself says, means engineering the data management practice.

When I explain our vision, I like to use parallelism with Software Engineering: after so many years, it has entered into a new era where writing code has moved way beyond design patterns, embracing governance techniques and practices such as:

  • source version control
  • testing and integration
  • monitoring and tracing
  • documentation
  • review
  • deployment and delivery
  • …and more

This evolution wasn’t driven by naïve engineers with time to waste; instead, it just happened because the industry realized that software has a massive impact, over time, on everything and everyone. A malfunctioning software with poor automation, not secure by design, hardly maintainable or evolvable has an intrinsic cost (let’s call it Technical Debt) that is devastating for the organization that owns it (and, implicitly, for the ones using and relying on it).

To mitigate these risks, we all tried to engineer software production to make it sustainable in the long run, aware of the raise of up-front costs.

When we talk about Engineering, in a broader viewpoint, we aim precisely to this: designing detailed and challenged architectures, adopting methodologies, processes, best practices, and tools to craft reliable, maintainable, evolvable, reusable products and services that meet the needs of our society.

Would you walk over a bridge made of some wooden planks (over and over again, maybe to bring your kids to school)?

Now, I don’t understand why some people talk about Data Engineering when, instead, they refer to just creating some data or implementing ETL/ELT jobs. It’s like talking about Bridge Engineering while referring to putting some wooden planks between two pillars: here the wooden planks are the ETL/ELT processes connecting the data flows between two systems. Let me ask you: is this really how would you support an allegedly data-driven company?

I see a significant disconnection between business strategies and technical implementations: tons of tactical planning and technological solutions but with no strategic drivers.

It’s like if you want to restructure a country’s viability system and you start making some roads or wooden bridges here and there. No, I’m sorry, it doesn’t work. You need a strategic plan, you need investments and, above all, you need to know what you are doing. You need to involve the best designers and builders to ensure that the bridges don’t collapse after ten years (we should have learned something recently). It is not something that can be improvised. You cannot throw workers to do that without proper skills, methodology, and a detailed project while hoping to deliver something of strategic importance.

Fortunately, there has been an excellent evolution around data practice in the last few years, but real-world data-driven initiatives are still a mess. That doesn’t surprise me because, in 2022, there are still a lot of companies that produce poor-quality software (no unit-integration-performance testing, no CI/CD, no monitoring, quality, observability, etc…). This is due to a lack of engineering culture. People are trying to optimize and save the present without caring about the future.

Since the DevOps wave arrived, the situation has moved a lot. Awareness increased, but we are still far from having a satisfactory average quality standard.

Now let’s talk about data

Data is becoming the nervous system of every company, large or small, a fundamental element to differentiate on the market in terms of services offer and automated support to business decisions.

If we genuinely believe that they are such a backbone, we must treat them “professionally”; if we really consider them a good, a service, or a product, we must engineer them. There is no such thing as a successful product that has not been properly engineered. The factory producing such products has undoubtedly been super-engineered to ensure low production costs along with no manufacturing defects and high sustainability.

Eventually, you must figure out if data for your business is that critical and behave accordingly, as usually done with everything else (successful) around us.

I can proudly say that at Agile Lab we believe data is crucial, and we assume that this is just as true for our customers.

That’s why we’ve been trying to engineer the data production and delivery process for years from a design, operation, methodology, and tooling perspective. Engineering the factory but not the product, or vice versa, definitely leads to failure in the real world. It is important to engineer the data production lifecycle and the data itself as a digital good, product, and service. Data does not have only one connotation, instead it is naturally fickle. I’d define it as a system that must stay balanced because, exactly as in quantum systems, the reality of data is not objective but depends on the observer.

Our approach has always been to consider the data as the software (and in part, the data is software), but without forgetting the greater gravity (Data Gravity — in the Clouds — Data Gravitas).

Here I see a world where top-notch Data Engineers make the difference for the success of a data-centric initiative. Still, all this is not recognized and well-identified like software development because just a few people know what a good data practice looks like.

But that’s about to change because, with the rise of Data Mesh, data teams will have to shift gears, they won’t have anyone else to blame or take responsibility in their places, they’ll have to work to the best of their ability to make sure their products are maintainable and evolvable over time. The various data products and their quality will be measurable, objectively. Since they will be developed by different teams and not by a single one, it will be inevitable to make comparisons: it will be straightforward to differentiate teams working with high-quality standards from those performing poorly, and we sincerely look forward to this.

The revolutionary aspect of Data Mesh is that, finally, we don’t talk about technology in Data Engineering! Too many data engineers focus on tools and frameworks, but practices, concepts, and patterns matter also if not more: in fact, eventually technologies are all due to become obsolete, instead of good organizational, methodological, and engineering practices that can long last.

Time to show the cards

So let’s come to what we think are the characteristics and practices that should be part of a good Data Engineering process.

Software Quality:

The software producing and managing data must be high-quality, maintainable, evolvable. There’s no big difference between software implementing a data pipeline and a microservice in terms of quality standards! In terms of workflow, ours is public and open source. You can find it here

Software design patterns and principles must also be applied to data pipelines development. Please stop scripting data pipelines (notebooks didn’t help here).

Data Observability:

Testing and monitoring must be applied to data, as well as to software, because static test cases on artifact data are just not enough. We do this by embracing the principles of DODD (Kensu | DODD). Data testing means considering the environment and the context where data lives and is observed. Remember that data offers different realities depending on when and where it’s evaluated, so it’s important to apply principles of continuous monitoring to all the lifecycle stages, from development through deployment, not only to the production environment.

When we need to debug or troubleshoot software we always try to have the deepest possible context. IDEs or various monitoring and debugging tools help by automatically providing the execution context ( stack trace, logs, variables status, etc.) letting me understand the system’s state at that point in time. 

How often did you see “wrong data” in a table having no idea on how it got there? How exausting was to: hand the piece of code allegedly generating such data, hoping it’s versioned not hardcoded somewhere, tried to mentally re-create the corner case that might have caused the error, searched for unit tests supposedly covering such corner case, scraping the documentation hoping to find references of that scenario… It’s frustrating, because data seems just plunged from the sky, with no traces left. I think 99% of data pipelines implemented out there have no links between the input (data), the function (software) and the output (data) thus making troubleshooting or, even worse, the discovery of an error just a nightmare.

In Agile Lab, we apply the same principles seen for software: we always try to have as much context as possible to understand better what’s happening to a specific data row or column. We try to set up lineage information (stacktrace), data profiling (general context) and data quality before and after the transformations we apply (execution context). All these information need to be usable right when needed and be well organized.

“OK you convinced me: which tool can make all of this possible ?”

Well, the reality is that practice is about people — professionals — not tools. Of course, many tools help in implementing the practice, but without the right mindsets and skills they are just useless.

“But I have analysts creating periodic reports and they can spot unexpected data values and distributions”

True, but ex-post checks don’t solve the problem of consumers having probably already ingested faulty data (remember the data gravity thing?) and they don’t provide any clues about where and why the errors came up. This to say that it makes a huge difference to proactively anticipate checks and monitoring so to make them synchronous with the data generation (and part of applications-generating-data lifecycle).

Decoupling:

It has become a mantra in software systems to have well-decoupled components that are independently testable and reusable.

We use the same measures with different facets in data platforms and data pipelines.

A data pipeline must:

  • Have a single responsibility and a single owner of its business logic.
  • Be self-consistent in terms of scheduling and execution environment.
  • Be decoupled from the underlying storage system.
  • I/O actions must be decoupled from business (data transformation) logic, so to facilitate testability and architectural evolutions.

Also, it is essential to have well-decoupled layers at the platform level, especially storage and computing.

Reproducibility:

We need to be able to re-run data pipelines in case of crashes, bugs, and other critical situations, without generating side effects. In the software world, corner case management is fundamental, just like (or even more) in the data one (because of gravity).

We design all data pipelines based on the principles of immutability and idempotence. Despite the technology, you should always be in the position to:

  • Re-execute a pipeline gone wrong.
  • Re-execute an hotfixed/patched pipeline to overwrite the last run, even if it was successful.
  • Restore a previous version of the data.
  • Run a pipeline in recovery mode if you need to clean up an inconsistent situation. The pipeline itself holds the cleaning logic, not another job.

So far, we have talked about data itself, but even more critical aspect is the underlying environments and platforms status reproducibility. Being able to (re)create an execution environment with no or minimum “manual” effort could really make the difference for on-the-fly tests, no-regression tests on large amounts of data, environment migrations, platform upgrades, switch on/simulate disaster recovery. It’s an aspect to be taken care of and can be addressed through IaC and architectural Blueprint methodologies.

Data Versioning:

The observability and reproducibility issues are rooted in the ability to move nimbly on data from different environments as well as from different versions in the same environment (e.g., rollback a terminated pipeline with inconsistent data).

Data, just like software, must be versioned and portable with minimum effort from one environment to another.

Many tools can help (such as LakeFS), but when they are not available it’s still possible to set up a minimum set of features allowing you to efficiently:

  • · Run a pipeline over an old version of data (e.g., for repairing or testing purposes)
  • Make an old version of the data available in another environment.
  • Restore an old version of the data to replace the current version.

It is critical to design these mechanisms from day zero because they can also impact other pillars.

Vulnerability Assessment:

It is part of an engineer’s duties to take care of security, in general. In this case, data security is becoming a more and more important topic and must be approached with a by-design logic.

It’s essential to look for vulnerabilities in your data and do it in an automated and structured way so to discover sensitive data before they go accidentally live.

This topic is very important at Agile Lab, so much that we have even developed proprietary tools to promptly find out if sensitive information is present in the data we are managing. These checks are continuously active, just like Observability checks.

Data Security at rest:

Fine-grained authentication and authorization mechanisms must exist to prevent unwanted access to (even just portion of) data (see RBAC, ABAC, Column and Row filters, etc…).

Data must be protected, and processing pipelines are just like any other consumer/producer so policies must be applied to storage and (application)users, that’s also why decoupling is so important.

Sometimes, due to strict regulation and compliance policies, it’s not possible to access fine-grained datasets, although aggregates or statistical calculations might still be needed. In this scenario, Privacy-Preserving Data Mining (PPDM) comes in handy. Any well-engineered data pipeline should compute its KPIs with the lowest possible privileges to avoid leaks and privacy violations (auditing arrives suddenly and when you don’t expect it). Have fun the moment you need to ex-post troubleshoot a pipeline that embeds PPDM, which basically gives you no data context to start from!

Data Issues:

Data Observability chapter provides capabilities to perform root cause analysis, but how do we document the issues? How do we behave when a data incident occurs? The last thing you wanna do is to “patch the data”; what you need to do is fixing the pipelines (applications), which will then fix the data.

First, the data issues should be reported just like any other software bug. They must be documented accurately, associating them with all the contextual information we extracted from the data observability, to clarify what would have been expected, what happened instead, what tests have been prepared and why they did not intercept the problem.

It is vital to understand the corrective actions at the application level (they could drive automatic remediation) as well as the additional tests that need to be implemented to cover the issue.

Another aspect to consider is the ability to associate data issues to the specific version of both software and data (data versioning) responsible of the issue because, sometimes, it’s necessary to put in place workarounds to restore business functionality first. At the same time, while data and software evolve independently, we must reproduce the problem and fix it.

Finally, to avoid repeating the same mistakes, it’s crucial to conduct a post-mortem with the team and ensure these are saved in a simil-wiki or somehow associated with the issue, searchable and discoverable.

Data Review:

Just as code reviews are a good practice, a good data team should perform and document data reviews. These are carried out at different development stages and cover multiple aspects such as:

  • Documentation completeness
  • Semantic and syntactic correctness
  • Security and compliance
  • Data completeness
  • Performance / Scalability
  • Reproducibility and related principles
  • · Test cases analysis and monitoring (data observability)

We encourage adopting an integrated approach across code base, data versioning, and data issues.

Documentation and metadata:

Documentation and metadata should follow the same lifecycle of code and data. Computational governance policies of the data platform should prevent new data initiatives to be brought into production if not well paired with the proper level of documentation and metadata. In fact, this aspect in the Data Mesh will become a foundation for having good data products that are genuinely self-describing and understandable.

Having ex-post metadata creation and publishing can introduce a built-in technical debt due to the potential misalignment with the actual data already available to consumers. It’s then good practice to make sure that metadata resides within the same repositories of pipelines’ code generating the data, which should imply synchronization between code and metadata and ownership by the very same team. For this reason, metadata authoring shouldn’t take place on Data Catalogs, which are supposed to help consumers find out the information they need. Embedding a metadata IDE to cover malpractices in metadata management won’t help in cultivating this culture.

Performance:

Performance and scalability are two engineering factors that are critical in every data-driven environment. In complex architectures, there can be hundreds of aspects with an impact on scalability and good data engineers should possess them as foundational design principles.

Once again, I want to highlight the importance of a methodological approach. There are several best practices to encourage:

  • Always isolate the various read, compute, and write phases by running dedicated tests that will point out specific bottlenecks thus allowing better-targeted optimization: when you’re asked to reduce the execution time of a pipeline by 10%, it must be clear which part of the process is worth working on. A warning here: isolating the various phases brings up the theme of decoupling and single responsibility. If your software design doesn’t respect these principles, it won’t be modular enough to allow clear steps isolation. For example, Spark has a lazy execution model where it is very hard to discern read times from transformation and write times if you are not prepared to do so.
  • Always run several tests on each phase with different (magnitudes of) data volumes and computing resources to extract scaling curves that will allow you to make reasonable projections when the reference scenario will change: it’s always unpleasant to find out that input data is doubling in a week and we have got no idea of how many resources will be needed to scale up. Worst case scenario to performance-test pipelines against should be designed with long-term data strategy in mind.
  • Always run several tests on each phase with different (magnitudes of) data volumes and computing resources to extract scaling curves that will allow you to make reasonable projections when the reference scenario will change: it’s always unpleasant to find out that input data is doubling in a week and we have got no idea of how many resources will be needed to scale up. Worst case scenario to performance-test pipelines against should be designed with long-term data strategy in mind.

Wrapping up

If you’ve come this far, you’ve understood what Data Engineering means to us.

Elite Data Engineering is not focusing only on data. Still, it guarantees that the data will be of high quality, that it will be easy to find and correct errors, that it will be simple and inexpensive to maintain and evolve, and that it will be safe and straightforward to consume and to understand.

It’s about practice, data culture, design patterns, broader sensitivity across many more aspects we might be used to when “just dealing with software”. At Agile Lab, we invest every single day in continuous internal training and mentoring on this, because we believe that Elite Data Engineering will be the only success factor of every data-driven initiative.

If you’re interested in other articles on the Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

 

10 practical tips to reduce Data Mesh’s adoption roadblocks and improve domains’ engagement

10 practical tips to reduce Data Mesh’s adoption roadblocks and improve domains’ engagement

Photo by Roberto Coluccio 

Hi! If you got here is probably because:

A) you know what Data Mesh is;

B) you are about to start (or already started) a journey towards Data Mesh;

C) you and your managers are sure that Data Mesh is the way to sky-rocket the data-driven value for your business/company; but

D) you are among the few who have bought the thing in and around you there are skepticism and worries.

The good news is: you’re DEFINITELY not alone! 

At Agile Lab we’re driving Data Mesh journeys for several customers of ours and, every time a new project kicks off, this roadblock is always one of the first to tackle.

Hoping to help out anyone just approaching the problem and/or looking for some tips, here are 10 ways we found out being somehow successful in real Data Mesh projects. We’re gonna group them by “area of worry”.

BUDGET 💸

Photo by Karolina Grabowska from Pexels

Well, you know, in the IT world every decision is usually based on the price VS benefit trade-off.

 Domains are expected to start developing Data Products as soon as possible and we need them aboard, but usually “there’s no budget for new developments”, or “there’s no budget to enlarge the cluster size”. There you go:

1. Compute resources to consume data designed as part of consumers’ architecture

This is a general architectural decision rather than a tip: you should design your output port (and their consumption model) so to provide shared governed access to potential consumers: with simple ACLs over your resources, you can allow secure read-only access to your Data Products’ output ports so that no compute resources are necessary at the (source) domain/Data Product level in order to consume that data, all the power is required (AND BILLED) at the consumers’ sides.

Ok, so let’s say we figured out a way to avoid (reduce?) TCO for our Data Products. But I’m what about the budget to develop them? 

2. Design a roadmap of data products creation that makes use of available operational systems maintenance budget streams

In complex organizations there are usually several budget streams flowing for operational systems maintenance: leverage them to create Source Aligned Data Products (which are flawlessly part of the domain so nobody will argue)!

Very well, we’re probably now getting some more traction, but why should we maintain long-term ownership over this stuff?

3. Pay-as-you-consume billing model

I know, this is bold and, so far, still in the utopia sphere. But there’s a key point to catch: if you figure out a way to “give something back” to Data Products’ creators/owners/maintainers, being that extra budget for every 5 consumers of an output port, or participation by the consumers to the maintenance budget … anything can pull up the spirit and diffuse the “it’s not a give-only thing this one” mood. If you get it, you win.

But let’s say I’m the coordinator of the Data Mesh journey and I received a budget just for a PoC… well, then:

4. Start small, with the most impactful yet less “operative business-critical” consumer-driven data opportunity

No big bang, don’t make too much loud (yet), pick the right PoC use case and make it your success story to promote the adoption company-wise, without risking too much.

This will probably make the C-level managers happy. The same ones who are repeatedly asking for the budget to migrate (at least) the analytical workloads to the Cloud, without receiving it “because we’re still amortizing the expense for that on-prem system”. Well, here’s the catch: if they want you to build a Data Mesh, you’re gonna need an underline stack that allows you to develop the “Self-Serve Infrastructure-as-a-Platform”. That’s it:

5. Leverage the Data Mesh journey as an opportunity to migrate to the Cloud

Your domain teams will probably be happy to fast forward to the present days technology-wise and work on Cloud-based solutions.

 

This last tip brings to light the other “area of worry”:

THE UNKNOWN ⚠

Photo by Jens Johnsson from Pexels

Data Mesh is an organizational shift that involves processes and, most of all, people.

People are usually scared of what they don’t know, especially in the work environment where they might be “THE reference” to everybody’s eyes for system XY, while they’d feel losing power and control by “simply becoming part of a decentralized domain’s Data Product Team”. 

First of all, you’re gonna need to:

6. Train people

Training and mentoring of key (influencer?) people across the domains is a successful-proven way of making people UNDERSTAND WHY the company decided to embrace this journey. Data Mesh Experts should mentor the focal point so that they can, in their turn, spread the positive word within their teams.

But teams are probably scared too of being forced to learn new stuff, so:

7. Make people with “too legacy technical background” understand this is the next big wave in the data world

Do they really wanna miss the train? Changes like this one might come once every 10 or 20 years in big organizations: don’t waste the opportunity and get out of that comfort zone!

Nice story, so are you telling me we have organizational inefficiencies and we don’t make smart use of our data to drive or automate business decisions? Well, my dear, it’s not true: MY DOMAIN publishes data that I’ve been told (by a business request) should be used somewhere.

I see you, but understand that your view might be limited. There are plenty of others complaining about change management issues, dependencies issues, slowness in pushing evolutions and innovations, lack of ownership and quality, etc.. etc.. etc …

8. Organize collaborative workshops where business people meet technical people and share pain points from their perspectives

You will be surprised by the amount of “hidden technical debt” that will come out, and people will start empathizing with each other: this will start transferring the “My data can bring REAL VALUE” (because real people are complaining in front of them) to the organization.

Does this end once the Data Products have been released? Absolutely not! We talked about «long-term ownership» so we need to provide «long-term motivation»:

9. Collect clear insights about the value produced with data and make it accessible to everybody within the organization (through the platform)

This can start as simple as «Number of consumers of a Data Product» (what if it’s zero?) and evolve in more complex metrics like the time-to-market (it’s expected to be reduced, it’s the time to bring into production a new Data Product), the network effect (the more the interconnections across Domains, the more value is probably added afterwards), and other metrics more specific and valuable to your corporation.

You convinced me. I will develop Data Products. Where do I start?

10. Organize collaborative workshops to identify data opportunities

Following the DDD (Domain Driven Design) practice the first pillar of Data Mesh is based on, Domain-oriented Decentralized Ownership, and adding it up to the Data-as-a-Product one, we get a “business decisions”-driven design model to identify data opportunities (i.e. Data Products to be created in order to support/automate data-driven business decisions).

 

On this topic, you might be interested in the Data Product Flow we designed at Agile Lab, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

Author: Roberto Coluccio, Agile Lab Data Architect

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.

Customer 360 and Data Mesh: friends or enemies?

Customer 360 and Data Mesh: friends or enemies?

Raise a hand who saw, was asked to design, tried to implement, struggled with the “Customer 360” view/concept in the last 5+ years…

Come on, don’t be shy …

How many Customer 360 stories did you see succeeding? Me? Just … a few, let’s say. This made me ask: why? Why put so much effort into creating a monolithic 360 view thus creating a new maintenance-evolution-ownership nightmare silo? Some answers might fit here, in the context of centralized architectures, but recently a new antagonist to fight against came in town: Data Mesh.

“Oh no, another article about the Data Mesh pillars!”

No, I’m not gonna spam the web with yet-another-article about the Data Mesh principles and pillars, there’s plenty out there, like thisthis, or those, and if you got here is because you might already know what we’re talking about.

The question that arises is:

How does the (so struggling or never really achieved) Customer 360 view fit into the Data Mesh paradigm?

Well, in this article I’ll try to come up with some options, coming from some real AgileLab’s customers approaching the “Data Mesh journey”.

This fascinating principle, inherited (in terms of effectiveness) from the microservices world, brings to light the necessity to keep together tech and business knowledge, within specific bounded contexts, so as to improve autonomy and velocity of data products lifecycle along with a smoother change management. Quoting Zhamak Dehghani:

Eric Evans’s book Domain-Driven Design has deeply influenced modern architectural thinking, and consequently the organizational modeling. It has influenced the microservices architecture by decomposing the systems into distributed services built around business domain capabilities. It has fundamentally changed how the teams form, so that a team can independently and autonomously own a domain capability.

Examples of well-known domains are Marketing, Sales, Accounting, Product X, or Sub-company Y (with subdomains, probably). All of these domains have probably something in common, right? And here we connect with the prologue: the customer. We all agree on the importance of this view, but eventually do we really need this to be materialized as a monolithic entity of some sort, on a centralized system?

I won’t answer this question, but I’ll make another one:

Which domain would a customer 360 view be part of?

Remember: decentralization and domain-driven design imply having clear ownership which, in the Data Mesh context, means:

  • decoupling from the change management process of other domains
  • owning the persistence (hence: storage) of the data
  • provide valuable-on-their-own views at the Data Products’ output ports (is a holistic materialized 360 customer view really valuable-on-its-own?) — this is part of the “Data as a Product” pillar, to be precise
  • guarantee no breaking changes to domains that depend on the data we own (i.e. own the full data lifecycle)
  • (last but not least) having solid knowledge (and ownership) of the business logic orbiting around the owned data. Examples are: knowing the mapping logic between primary and foreign keys (in relational terms), knowing the meaning of every bit of information part of that data, knowing what are the processes that might influence such data lifecycle

If you can’t now answer the above question DON’T WORRY: you’re not alone! It just means you’re starting as well to feel the friction between the decentralized ownership model and the centralized customer 360 view.

IMHO, they are irreconcilable. Here’s why, with respect to the previous points:

  • centralized ownership over all the possible customer-related data would couple the change management processes of all the domains: a 360 view would be the only point-of-access for customer data, thus requiring strong alignment across the data sources feeding the customer 360 (which usually would end up slooooowing down every innovation or evolution initiative). Legit question: “So, this means we should never join together data coming from different domains?” — Tough but realistic answer: “You should do that only if it makes sense, if it produces clear value, if it doesn’t require further interpretation or effort on the consumers’ side to grasp such value”.
  • If you bind together data coming from different domains, it must be to create a valuable-on-its-own data set. Just the fact that you read, store and expose data would make you the owner of it, that’s why a customer 360 persisted view would make sense only actually owning the whole dataset, not just a piece of it. Furthermore, Data Products require immutable and bi-temporal data at their output ports, can you imagine the effort of delivering such features over such a huge data asset?
  • valuable-on-its-own Data Products, well, I hope the previous 2 points already clarified this point 😊
  • If we sort out all the previous issues and actually become the owner of a customer 360 persisted materialized view, while still being compliant to the Data Mesh core concepts, we should never push breaking changes towards our data consumers. If we need to do make a breaking change, we must guarantee sustainable migration plans. Data versioning (like LakeFS or Project Nessie) could come in handy in this case, but there might be several organizational and/or technical ways to do so. Having many dependencies (like we’d have owning a customer 360 view) would make it VERY hard to always have a smooth data lifecycle and avoid breaking changes since we could find ourselves just overwhelmed by breaking changes made at the source operational systems side.
  • The whole Data Mesh idea emerged as a decentralization need, after seeing so many “central data engineering / IT teams” fall under the pressure of all the organization which was relying on them to have quality data. It was (is) a bottleneck in many cases worsened by the fact that these centralized tech teams couldn’t just have sufficiently strong business knowledge of every possible domain (while unfortunately they were developing and managing all the ETLs/ELTs of the central data platform). For this very same reason, the Customer 360 team would be required to master the whole business logic around such data, i.e. being THE acknowledged experts of basically every domain (having customer-related data) — just impossible.
a customer360 logical view, with domains owning only a slice of it

OK, I’m done with the bad news 😇 Let’s start with the good ones!

Customer 360 view and Data Mesh can be friends

As long as it remains a holistic logical/business concept. Decentralized domains should own the data related to their bounded contexts, even if referring to the customer. Domains should own a slice of the 360 view and slices should be correlated together to create valuable-on-their-own Data Products (e.g. just a join between sales and clickstreams doesn’t provide any added value, but a behavioral pattern on the website with the number of purchases per customer with specific browsing-behaviors do).

In order to achieve that, domains and – more in general – data consumers must be capable of correlating customer-related data across different domains with ease. They must be facilitated on the joining logic level, which means they should NOT be required to know the mapping between domain-specific keys, surrogate keys, and customer-related keys. I’ll try to expand this last element.

Globally Unique Identifiable Customer Key

According to the Data Mesh literature, data consumers shouldn’t always copy (ingest in the first place) data in order to do something with it, since storing data means having ownership over it. As a technical step justified by volumes or other constraints for a single internal ETL step, that’s ok, but Consumer Aligned Data Products shouldn’t just pull data from another Data Product’s output port, perform a join or append another column to the original dataset, and publish at their output port an enhanced projection of some other domain’s data, for many reasons, but I won’t digress on this (again, it’s part of the Data as a Product pillar). What data consumers need is a well-documented, globally identifiable, and unique key/reference to the customer, in order to perform all the possible correlations across different domains’ data. The literature would also call them polysemes.

Notewell-documented should become a computational policy (do you remember the Federated Governance Data Mesh’s pillar?) requiring for example that, in the Data Product’s descriptor (want to contribute to our standardization proposal?), a specific set of metadata must describe where a certain field containing such key comes from, which domain generated it, what business concepts it points to. Maybe that could be done also leveraging a syntax or language that will facilitate the automated creation (thank you Self-Serve Infrastructure-as-a-Platform) of a Knowledge Graph afterward.

OK, the concepts of polysemes and the globally unique identifiable customer-related key are not new to you, and they shouldn’t, but what I want to put the lights on is how to conjugate it with the Data Mesh paradigm, especially because several architectural patterns are technically possible but just a few (one?) of them should guarantee long term scalability and compliance with the Data Mesh pillars.

An MDM is still a good alley

The question you might have asked yourself at this point is:

Who has the ownership of issuing such a key?

The answer is probably in the good-old-friend the Master Data Management (MDM) system, where the customer’s golden record data can be generated. There’s a lot of literature (example) on that concept, also because it’s definitely not new in the industry (and that’s why a lot of companies approaching Data Mesh are struggling to understand how to make it fit in the picture since it can just be thrown away after all the effort spent on building it up).

Note: eventually, such a transactional/operational system might also lead to developing a related Source Aligned Data Product, but it should just be considered as a source of the customers’ registry information.

OK, but who should be then responsible of performing the reverse lookup, to map a domain-specific record (key) to the related customer (golden record’s) key?

We narrow down the spectrum into 3 possible approaches. To facilitate the reading of what follows, I’d like to point out a few things first:

  • SADP = Source Aligned Data Product
  • CADP = Consumer Aligned Data Product
  • CID = Customer MDM unique globally identifiable key
  • K = example of a domain-specific primary key
Approach 1: The Ugly
Approach 2: The Bad
Approach 3: The Good

The latter approach is what could make a customer 360 view shine again! This way, domains — and in particular the operational systems’ teams- preserve the ownership of the mapping logic between the operational system’s data key and the CID, while in the analytical plane no other interactions with the operational one are required to create Source Aligned Data Products other than pulling data from the source systems, and at the Consumer Aligned Data Products sides no particular effort is spent to correlate customer-related data coming from different domains.

Disclaimer 1: approach 3 requires strong Near-Real-Time alignment between the Customer MDM system and the other operational systems since a misalignment could imply publishing data from the various domains not attached to a brand new available CID (many MDM systems operate their match-and-merge logic in batch, or many operational systems might think to update-set the CID field with batch reverse lookups quite after domain data is generated).

Disclaimer 2: a fourth approach could be possible, i.e. having microservices issuing in NRT global IDs which become master key in all the domains having customer-related data. This is usually achievable in custom implementations only.

Wrapping up

In the Data Mesh world, domain-driven decentralized ownership over data is a must. The centralized Customer 360 monolithic approach doesn’t fit the picture, so a shift is required in order to maintain what actually matters the most: the business principle of customer-centric insights, derived from the correlation of data taken from different domains (owners) via a well documented and defined globally unique identifier, probably generated at Customer MDM level and integrated “as left as possible” into the operational systems, so to reduce the burden at the Data Mesh consumers’ side of knowing and applying the mapping logic between domain-specific key and Customer MDM key.

. . . 

What has been presented has been discussed with several customers and it seemed to be the best option so far. We hope the hype around Data Mesh will bring some more options in the very next years and will keep our eyes open but, if you already put in place a different approach, I’ll be glad to know more.

 

If you’re interested in other articles on the Data Mesh topics, you can find them on our blog, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

Author: Roberto Coluccio, Agile Lab Data Architect

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.

How and why Data Mesh is shaping the data management’s evolution

How and why Data Mesh is shaping the data management’s evolution

 

What is Data Mesh?

Data mesh is not a technology, but it is a practice…a really complex one. It is a new way to look at data, data engineering, and all the processes involved with that, i.e. Data Governance, Data Quality, auditing, Data Warehousing, and Machine Learning. It’s completely changing the perspective on how we look at data inside a company: we are used to looking at them from a technical standpoint, so how to manage them, how to process them, which is the ETL tool, which is the storage….and so on….while it’s time to look at them from a business perspective and then deal with them as they are real assets, let’s say as they are a product. Therefore, in the data mesh practice we need to reverse our way of thinking about data: we tipically see data as a way to extract value for our company, but here instead we need to understand that data IS THE VALUE.

Why is relevant? What problem is going to solve?

If in the past years you invested a lot in Data Lake technologies I have a piece of bad news and a good one.

The good one is: you can understand which are the problems and why Data Mesh is going to solve them.

The bad one is: probably you’ll need to review your data management strategy if you don’t want to fall behind your competitors.

The main problem is related to how Data Lake practice has been defined. At the first stage (I’m talking about the period from 2015 to 2018, at least in the Italian market) it was not a practice, it was a technology: in the specific, it was Hadoop that was the data lake.

It was very hard to manage, mainly on-premise environments, skills on such technologies were lacking. And… there was no practice defined. Think about that first person that coined the “data lake” term, James Dixon, who was the CTO of Pentaho: Pentaho is an ETL tool, so it’s easy to understand that focus was on how to store and how to transform huge quantities of data, not on the process of value extraction or value discovery.

From an organizational perspective, it was a good option to set up a data lake team in charge of centrally managing the technical complexity of the data lake and, when things became more complicated, the same team was in charge of managing different layers (cleansed layer, standardized layer, harmonization layer, serving layer) … all technical ones, obviously.

Then, this highly specialized team in most cases was managing all the requests coming from the business, which was in charge of gaining insights from data, yet without knowledge on the data domains because it always focused on technicalities and it is still relying on operational system teams to gain functional and domain information about the data they are managing. This is representing a technical and organizational bottleneck and this is why a lot of enterprises are struggling in extracting value from data: because the implementation of a single KPI generates a change management storm on all the technical layers of the data lake; also it’s requiring to involve people from operational systems that are not engaged and interested in such process…not their duty. This is a film that is not going to work because there is no alignment in the purpose of all the actors (also because the organization itself is not oriented to the business purpose).

Why is not super popular then? How is going to solve the problems you mentioned before?

Data mesh is putting data at the center. As I said Data is the value, data is the product…so the greatest effort is spent in letting people CONSUME data instead of processing and managing them.

The core idea is that we will have only two stakeholders, who are producing data, let’s say selling data, and who are consuming data, buying them.

And we all know that in a real-world the customer is ruling the market, so let’s think about which are the needs of data consumers:

1. they want to be able to consume all the data of the company in a self-service way. Amazon-like, you browse and you buy, so basically you are consuming products from a marketplace.

2. they don’t want to deal with technical integration every time they need a new data product.

3. they don’t want to spend time trying to understand which are the characteristics of the data, e.g. if you are willing to buy something on Amazon, are you going to call the seller to understand better how the product looks like? NO! The product is self-describing, you must be able to get all the information you need to buy the product without any kind of interaction with the seller. That’s the key, because any kind of interaction is tightly coupled, sometimes you need to find a slot in the agenda, wait for documentation, and so on, this requires time thus destroying the data consumer time to market !!!

4. they want to be able to trust the data they are consuming. Checking the quality of the product is a problem for who is selling, not buying it. If you buy a car, you don’t need to check all the components, there is a quality assurance process on the seller side. Instead, I saw many and many times performing quality assurance in the Data Lake after ingesting data from the operational system, doing the same in the DWH after ingesting data from the lake, and again in BI tool maybe….after ingestion from DWH.

Data Mesh is simply enabling this kind of vision around data in a company. Simple concept, but really hard to build in practice: that’s why it’s still not mainstream ( but it is coming ). It has been coined in 2019 and we are still in an experimental phase with people exchanging opinions and trying it. What is truly game-changing for me is that we are not talking about technology, we are talking about practice!

What is the maturity grade of this practice?

Thoughtworks made an excellent job of explaining the concepts and evangelizing around them but, at the moment, there is no reference physical architecture.

I think that we, as Agile Lab, are doing an excellent job in going from theory to practice and we are already in production with our first data mesh architecture on a large enterprise.

Anyway, the level of awareness around these concepts is still low, in the community we are discussing data product standardization and other improvements that we need to take into account to make this pattern easier to implement. At the moment, it’s still too abstract and it requires a huge experience on data platforms and data processes to convert the concept into a real “Data Mesh platform”.

As happened with microservices and operational systems (Data Mesh has a lot of assonances with microservices), so far the new practice seems elegant from a conceptual standpoint but hard to be implemented. It will require time and effort but it will be worth it.

What is the first step to implement it?

My suggestion is to start identifying one business domain and try to apply the concepts only within these boundaries. To choose the right one, keep in consideration the ratio between business impact and complexity, which should be the highest. Reduce the number of technologies needed for the use case as much as you can (because they have to be provided self-service, mandatory).

In addition, try to build your first real cross-functional team, because the mesh revolution is also going through the organization and you need to be ready for that.

How much time is required for a large organization to fully embrace data mesh?

I think that a large enterprise with a huge legacy should estimate, at least, 3 years of journey to really comprehend and see benefits from it.

What is the hardest challenge in implementing a data mesh?

The biggest one is the paradigm shift that is needed and this is not easy for people working with data for a long time.

The second one is the application of evolutionary architecture principles to data architectures, an unexplored territory…it is not immediate, but if you think about data management systems, the architecture doesn’t change that much until is obsolete and you need a complete re-write.

In the Data Mesh, instead, we try to focus on producer/consumer interfaces and not on the data life cycle. This means that, until we keep this contract safe, we can evolve our architecture under the hood, one piece at a time and with no constraints.

Another challenge is the cultural change and the ways you can drive it in huge organizations.

Is there any way to speed up the process?

From a technical standpoint, we think that the key is to reach a community-driven standardization and interoperability of the data product as architectural quantum. Because after this it will be possible to build an ecosystem of products that will speed-up the process.

Anyway, the key components to create a data mesh platform are :

  • Data Product Templates: to spin-up all the components of a DataProduct, with built-in best practices
  • Data Product Provisioning Service: all the DataProduct deployments should go through a unique provisioning service in charge to apply computational policies, create the underlying infrastructure and feed the marketplace
  • Data Product Marketplace: a place where is easy to search and discover Data Products, not just data.

 

On this topic, you might be interested in the Data Product Flow we designed at Agile Lab, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.

How to identify Data Products? Welcome “Data Product Flow”

How to identify Data Products? Welcome “Data Product Flow”

 

We all know that Data Mesh is inspired by DDD principles (by Eric Evans), transposing them from the operational plane (microservice architecture) to the analytical one (data mesh).

Since one year and a half, we have been helping many big enterprises to adopt the data mesh paradigm, and it’s always the same old story. The concept itself is really powerful and everybody can catch the potential from the early days because it addresses real problems, so it is not possible to ignore it.

Once you have the buy-in of high-level concepts and principles, it is time to draft the platform capabilities. This step is game-changing for many companies and is often becoming challenging because it revolutionizes processes and tech stacks.

But the most complex challenge is another one; it is something that blows people’s minds away.

What is a Data Product? I mean, everyone understands the principles behind that, but when it comes to defining it physically… is it a table? Is it a namespace? How do I map it with my current DWH? Can I convert my Data Lake to Data Products? These are some of the recurring questions… and the answer is always “no” or “it depends”.

When we start to introduce concepts like bounded context and other DDD elements, most of the time is getting even harder because they are abstract concepts and people involved in Data Management are not familiar with them. We are not talking with software experts; DDD until now has been used to model software, online applications that need to replicate and digitalize business processes. Data Management people were detached from this cultural shift; they typically reason around tables, entities, and modeling techniques that are not business oriented: 3NF, Dimensional modeling, Data Vault, Snowflake model… all of them are trying to rationalize the problem from a technical standpoint.

So after a while, we arrive at the final question: How do we identify Data Products?

For DDD experts, the answer could seem relatively easy…but it is not !!!

Before to deep dive into our method to do that, let’s define an essential glossary about DDD and Data Mesh (coming from various authors):

Domain and Bounded Context (DDD): Domains are the areas where knowledge, behaviour, laws and activities come together. They are the areas where we see semantic coupling and behavioural dependencies. It existed before us and will exist after us; it is independent by our awareness.

Each domain has a bounded context that defines the logical boundaries of a domain’s solution, so bounded contexts are technical by nature and tangible. Such boundaries must be clear to all people. Each bounded context has its ubiquitous language (definitions, vocabulary, terminology people in that area currently use). The assumption is that the same information can have different semantics, meanings and attributes based on the evaluation context.

Entity (DDD): Objects that have a distinct identity running through time and different representations. You also hear these called «reference objects».

Aggregate (DDD): It is a cluster of domain objects or entities related to each other through an aggregate root and can be treated as a single unit. An example can be an order and its line items or a customer and its addresses. These will be separate objects, but it’s useful to treat the order ( together with its line items ) as a single aggregate. Aggregates typically have a root object that provides unique references for the external world, guaranteeing the integrity of the Aggregate as a whole. Transactions should not cross aggregate boundaries. In DDD, you have a data repository for each Aggregate.

Data Product (Data Mesh): It is an independently provisionable and deployable component focused on storing, processing and serving its data. It is a mixture of code, data and infrastructure with high functional cohesion. From a DDD standpoint, it is pretty similar to an Aggregate.

Output Port (Data Mesh): It is a highly standardized interface, providing read-only and read-optimized access to Data Product’s data.

Source-aligned Data Product (Data Mesh): A Data Product that is ingesting data from an operational system (Golden Source)

Consumer-aligned Data Product (Data Mesh): A Data Product that is consuming other data products to create brand new data, typically targeting more business-oriented needs

So, how do we identify Data Products?

Because a Data Product, from a functional standpoint, is pretty similar to an Aggregate, we can say that the Data Product’s perimeter definition procedure is pretty similar. Keep in mind that this is valid only for Source Aligned Data Products because drivers are changing when we enter into the data value chain.

In Operational Systems, where the main goal is to “run the business”, we define Aggregates by analyzing the business process. There are several techniques to do that. The one I like most is Event Storming (by Alberto Brandolini). It is a kind of workshop that helps you decompose a business process into events (facts) and analyze them to find the best aggregate structure. The steps to follow are straightforward:

  1. Put all the relevant business events in brainstorming mode
  2. Try to organize them with a timeline
  3. Pair with commands and actions
  4. Add actors, triggers, and systems
  5. Try to look for emerging structures
  6. Clean the map and define aggregates

The process itself is simple (I don’t want to cover it because there is a lot of good material out there), but what makes the difference is who and how is driving it. The facilitator must ask the domain experts the right questions and extract the right nuances to crunch the maximum amount of domain information.

There are other techniques like bounded context canvas or domain storytelling, but, if your goal is to identify Data Products, I would suggest using Event Storming because, in Data Mesh, one of the properties that we want to enforce is the immutability and bi-temporality of data, something that you mainly achieve with a log of events.

So, Event Storming is the preliminary step of our process to discover aggregates.

If your operational plane is already implemented with a microservice architecture with DDD, probably you already have a good definition of aggregates but, if you have legacy operational systems as golden sources, you need to start defining how to map and transform centralized and rationalized data management practice into a distributed one.

Once you have your business processes mapped, it’s time to think about data and how they can impact your business. The analytical plane is optimizing your business, and this comes with better and informed decisions.

Now we enter into our methodology: Data Product Flow.

Data Product Flow

Each event is generated by a command/action and a related actor/trigger. To make an action, people in charge of that step need to make decisions. The first step of our process is to let all the participants put a violet card describing the decision driving the creation of that event and related actions. This decision could be the actual one or how it could be in a perfect world. People participating should also think about the global strategy of their domain and how those decisions will help to move a step forward in that direction. We can now introduce the concept of Data Opportunities:

Data Opportunity is:

  • A more intelligent and more effective way to get a decision supported by data. You could be able to automate a decision fully or create some suggestions for it.
  • A brand new business idea that is going to be supported by Data. In that case, it will not be part of the identified aggregates.

At this stage, let’s start to talk about Data Products to emphasize that, from now on, the focus will entirely be on the data journey and how to be data-driven.

In this phase, the facilitator needs to trigger some ideas and do some examples to let the participants realize that now we are changing direction and way of thinking (compared with the previous part of the workshop). We are not focusing anymore on the actual business process, but we are projecting into the future. We are imagining how to optimize and innovate it, making it data-driven.

When people realize that they are allowed to play, they will have fun. It is crucial to let people’s creativity flow.

Once all the business decisions are identified, we start to think about what data could be helpful to automate ( fully or partially) them. Wait, but until now, we didn’t talk about data so much. Where the hell are they?

Ok, it’s time to introduce the read model concept. In DDD, a read model is a model specialized for reads/queries and it takes all the produced events and uses them to create a model that is suitable to answer clients’ queries. Typically we observe this pattern with CQRS, where starting from commands, it is possible to create multiple independent read-only views.

When we transpose this concept in the Data Mesh world, we have Output Ports.

For each Business Decision, let’s identify datasets that can provide help in automating them. Create a green post-it in each data product where we believe it is possible to find the data we need and position them close to the events that probably generate the data underlying such output port.

Please, pay attention to giving a meaningful name to the output port, respecting ubiquitous language and not trying to rationalize it ( there is always time to clean the picture, early optimization is also not good in this field ). I repeat, because it is vital, please provide domain-oriented and extensive output port names, don’t try to define a “table name”.

Do the same for “new business decisions”, those that are not part of the actual business process.

At this stage, your picture should be more or less like the following one.

If you are working in a real environment, you will soon realize something scary: while you are searching for datasets that can help with your business decisions, you would be able to imagine what information could be helpful, but you would not be able to find the right place for it on the board, because simply that kind information does not exist and you need to create it.
Probably this is your first consumer-aligned Data Product !!
Start creating a placeholder (in yellow) and create the output ports needed to support your business decisions.

These new output ports can be part of the same data product, or you may need to split into different data products. To define it, we need to apply some DDD principles again. This time we need to think about the business processes that will generate such data:

  • Is the same business process generating both datasets at the same time?
  • Do we need consistency between them?
  • Can we potentially apply for different long-term ownership on them or not?

There are many other rules and ways to validate data product boundaries, but it is not the focus of this article. Fortunately, in this case, we can unify the process under unique ownership and have two different output ports for the same data product because they have high functional cohesion. Still, we want to provide optimized read models to our customers.

This process can also have several steps, and you need to proceed backwards, but always starting from business decisions.

In this phase, if it is becoming too abstract and technical for domain experts, simplify and skip it; you can work on it offline later. It is essential to don’t lose domain expert attention.

Because Data Mesh is not embracing the operational plane of data, we now need to do something fundamental to stay consistent and model authentic data products.

At this stage, if on your board you have a data product that is taking business decisions leveraging output ports of other data products and also has the “system” post-it, it means you are mixing operational and analytical planes. It cannot be accurate because the business process is happening only in one of the two. Data Product 1 is purely operational because it is not going to make data-driven decisions. Data Product 2/3 processes are happening on operational systems for sure because we started from there. Still, we would like to automate or support some business decisions by reading data from the analytical plan. When you detect this situation is better to split the data product by keeping a source-aligned data product that maps on the operational system’s data and then create an additional DP to include the newly added value business logic. These are more consumer-aligned DPs because they are reading from multiple data products to create something entirely new and new ownership.

To understand better, this last step could be helpful to visualize it differently. With the standard business process (AS-IS), users make decisions, propagating them in the source-aligned DP. Suppose we want to support the user to make smarter or faster decisions. In that case, we can provide some data-driven suggestions through a consumer-aligned data product that is also mixing-in data from other DPs. In the end, it is also the same business decision. Still, it is better to split it into executive decisions and suggestions to separate the operational plane and the analytical one. If you are thinking of shrinking everything in the source-aligned DP, keep in mind that it includes the executive business decision (facts). You will end up mixing operational and analytical planes without any decoupling. It is also dangerous from an ownership standpoint because once you consume data from other domains, the DP will not be “source-aligned” anymore.

Also, if you can automate such business decisions fully, I strongly recommend keeping them as separate Data Products.

How to formalize all this stuff? Can we create a deliverable that is human readable and embeddable in documentation?

DDD is helping us again. We can use Bounded Context Canvas to represent all the Data Products.

I propose two different templates:

  • Source-aligned DP
  • Consumer-aligned DP

The final result is quite neat and clean.
Welcome, Data Product Flow as practice and workshop format.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

STAY TUNED!

If you made it this far and you’re interested in more info about the Data Mesh topics, or you’d like to know more about Data Mesh Boost, get in touch!

 

My name is Data Mesh. I solve problems.

My name is Data Mesh. I solve problems.

In spite of the innumerable advantages it provides, we often feel that technology has made our work harder and more cumbersome instead of helping our productivity. This happens when technology is badly implemented, too rigid in its rules or with a logic far removed from real world needs or even trying to help us when we don’t need it (yes, autocorrect, I’m talking to you: it is never “duck”!). But we would never go back to having to drive with a paper map unfolded on our lap in the “pre-GPS” days or to browsing the yellow pages looking for a hotel, a restaurant, or a business phone number. We need technology that works and that we are happy to work with. Data Mesh, like Mr. Wolf in Quentin Tarantino’s “Pulp Fiction” solves problems, fast, effectively and without fuss. Let’s take a look to see what it means in practice.

Imagine that….

What’s more frustrating than needing something you know you have, but you don’t remember where it is, or if it still works? Imagine that this weekend you want to go to your lake house, where you haven’t been in a long while. There is no doubt you have the key to the house: you remember locking the door on your way out and then thinking long and hard about a safe place where to store it. It may be at home, or even more likely in the bank’s safe deposit. Checking there first, while it wields a high probability of success, means having to drive to the bank, during business hours, and the risk of wasting half a working day for nothing. You can look for the key at home in the evening and it should be faster, but could also take forever if you don’t find it and keep searching, and then if it’s not at home you’d have to go to the bank anyway. You even had a back-up plan: you gave your uncle Bob a copy of the key. The problem is that you can’t remember if you also gave him a new one after you had to change the bolt in 2018…. If you did, Bob’s your uncle, also metaphorically, but are you willing to take the chance, drive all the way to the lake and maybe find out that the old key no longer works? You wanted to go to relax, but it looks like you are only getting more and more stressed out with the preparation, and now you are thinking if it’s really worth it…

That’s a real problem

The above scenario seems to only relate to our private lives.
In reality, this is one of the biggest hurdles in a corporate environment, as well, with the only difference being that at home we look for physical, “tangible” objects, while the challenge in a modern company is finding reliable data. It is not just a matter of knowing where the data is, but also if we can really trust it. In our spare time this is annoying, but we can afford this kind of uncertainty. More frequently than we care to admit, we either spend a disproportionate amount of time looking for what we have misplaced, making sure that it still works or fixing it, or we just plainly give up doing something pleasant and rewarding because we can’t be bothered to search for what we need, not knowing how long it will take. In the ever more competitive business world (whichever your business is, the competition is always tough!), avoiding extra expenses and never passing an opportunity are an absolute must. And you can’t afford these missteps “just” because you can’t find or trust your data.

Today it is highly unlikely that a company doesn’t have the data it needs for a report, a different KPI, a new Business Intelligence or Business Analytics initiative, for analysis to validate a new business proposition and so on. We seem, on the contrary, to be submerged by data and to always be struggling to manage it.
New questions arise. The real questions we face are therefore about the quality of data. Is it reliable? Who created that data? Does it come from within the company or has it been bought, merged after an acquisition, derived from a different origin dataset? Has it been kept up to date? Does it contain all the information I need in a single place? Is it in a format compatible with my requirements? What is the complexity (and hence the cost) of extracting the information I need from that data? Does that piece of information mean what I think it means? Is there anybody else in the company who already leveraged that dataset and, if so, how was their experience?

Ok, so there might be problems that need solving before you can use a certain data asset. That, by itself, is par for the course. You know that every new business analysis, every new BI initiative comes with its own set of hurdles. The real challenge lies in the capability of estimating these challenges beforehand, so as not to incur in budget overruns or other costly delays. It seems that every time you try to analyze the situation you can never get a definitive answer from data engineers regarding data quality or even the time needed to determine if the data is adequate. There is not way you can set a budget, both in terms of time and resources, to resolve the issues when you can’t know in advance which issues the data might or might not have, and you can’t even get an answer regarding how long it will take and how expensive it will be to find out. It is downright impossible to estimate a Time-To-Market if you don’t know either what challenges you will face, or when you will become aware of those challenges. Therefore, how can you determine if a product or service will be relevant in the marketplace, when there is no way to know how long it will take to launch it? Should you take the risk, or should the whole project be canned? This is the kind of conundrum that poor data observability and reusability, combined with an ineffective data governance policies put you into. And that is a place where you really don’t want to find yourself.

What companies have just recently come to realize is that the “data integration” problem is, in general, more likely to be tackled as a social, rather than a technical one. Business units are (or at least should be) responsible for data assets, taking ownership both in the technical and functional sense. In the last decade, on the contrary, Data Warehouse and Data Lake architectures (with all their declinations) took the technical burden away from the data owners, while keeping the knowledge over such data in the hands of the originating Business Units. Unfortunately, a direct consequence has been that the central IT (or data engineering team), once they put in place the first ingestion process, “gained” ownership of such data, thus forcing a Centralized Ownership. Here the integration breaks up: potential consumers who could create value out of data must now go through the data engineering team, who doesn’t have any real business knowledge of the data they provide as ETL outcomes. This eventually ends up in the potential consumers not trusting or not being able to actually leverage the data assets and thus being unable to produce value in the chain.

A brand new solution 

All the above implies that the time has come for not just as another architecture or technology, but rather a completely new paradigm in the data world to addresses and solve all these data integration (and organization) issues. That’s exactly what Data Mesh is and where it comes into play. It is, first and foremost, a new organizational and architectural pattern based on the principle of domain-driven design, a principle proven to be very successful in the field of micro-services, that now gets applied to data assets in order to manage them with business-oriented strategies and domains in mind, rather than being just application/technology-oriented.

In layman’s terms it means data working for you, instead of you working to solve technical complexities (or around them). Moreover, it is both revolutionary, for the results it provides, and evolutionary, as it leverages existing technologies and is not bound to a specific underlying one. Let’s try to understand how it works, from a problem-solving perspective.

Data Mesh is now defined by 4 principles:

  • Domain-oriented decentralized data ownership and architecture 
  • Data as a product 
  • Self-serve data infrastructure as a platform 
  • Federated computational governance. 

We already mentioned that one of the most frequent reasons of data strategies failures is the centralized ownership model, because of its intrinsic “bottleneck shape” and inability to scale up. The adoption of Data Mesh, first of all, breaks down this model transforming it into a decentralized (domain-driven) one. Domains must be the owner of the data they know and that they provide to the company. This ownership must be both from a business-functional and a technical/technological point of view, so to allow domains to move at their own speed, with the technology they are more comfortable with, yet providing valuable and accessible outcomes to any potential data consumer.

Trust and budget

“Data as a product” is apparently a simple, almost trivial, concept. Data is presented as if it were a product, easily discoverable, described in detail (what it is, what it contains, and so on), with public quality metrics and availability guarantee (in lieu of a product warranty). Products are more likely to be sold (reused, if we talk about data) if a trust relationship can be built with the potential consumers, for instance allowing users to write reviews/FAQs in order to help the community share their experience with that asset. The success of a data asset is driven by its accessibility, so data products must provide many and varied access opportunities to meet consumer needs (technical and functional), since the more flexibility consumers find the more likely they are to leverage such data. In the current era, data must have time travel capabilities, and must be accessible via different dynamics such as streams of events (if reasonable considering data characteristics), not just as tables of a database. But let’s not focus too much on the technical side. The truly revolutionary aspect is that now data, as a product, has a pre-determined and well-determined price. This has huge implications on the ability of budgeting and reporting a project. In a traditional system, even a modern one like a Data Lake, all data operations (from ingestion, if the data is not yet available, to retrieval and preparation so you can then have it in the format you need) are demanded to the data engineers. If they have time on their hands and can get to your request right away, that’s great for you, but not so much for the company, as it implies that there is overcapacity. This is not efficient and very, very expensive. If they are at capacity dealing with every-day operational needs and the projects already in the pipeline, any new activity has to wait indefinitely (also because of the uncertainties we discussed before) in a scenario where IT is a fixed cost for the company and therefore capacity can not be expanded on demand. Even if it can be expanded, it is nightmarish to determine how much of the added capacity is directly linked to your single new project, how much of the work can or could be reused in the future, how much would have eventually had to be done and so on. When multiple new projects start at the same time, the entire IT overhead can easily become an indirect cost, and those have a tendency to run out of control in each and every company. With Data Mesh, on the other hand, you have immediate visibility of what data is available, how much (and how well) it’s being used and how much it costs. Time and money are no longer unknown (or worse, unknowable in advance).

A game changer

To understand the meaning of the next two points, “Self-serve data infrastructure as a platform” and “Federated computational governance”, let’s draw a parallel with a platform type that “changed the game” in the past. It is an over-simplification, but bear with us. Imagine you work in Logistics and you need to book a hotel for your company’s sales meeting. You have a few requirements (enough rooms available, price within budget, conference room on premise, easy parking) and a list of “nice-to-have” (half board or restaurant so you don’t have to book a catering for lunch, shuttle to and from the airport for those who fly in, not too far from the airport of from the city center). Your company Data Lake contains all the data you need: it contains all the hotels in the city where the meeting will take place. But so did the “Yellow Pages” of yesteryear. How long does it take to find a suitable hotel using such a directory? It’s basically unknowable: if the Abbey Hotel satisfies all requirements, not too long, but good luck getting to the Zephir Hotel near the end of the list. Not only they are sorted in a pre-determined way (alphabetically), but you have to check each one in sequence by calling on the phone if you want to know availability, price and so on. Moreover, the quantity of information available for each hotel directly in the Yellow Pages is wildly inconsistent. It would be nice to be able to rule out a number of them, so you don’t have to call, but some hotel bought big spaces where they write if they have a conference room or a restaurant, other just list a phone number. If you also need to check the quality of an establishment, to avoid sending the Director of Sales to a squalid dump, the complexity grows exponentially, like the uncertainty of how long it will take to figure it out. Maybe you are lucky and find a colleague who’s already been there and can vouch for it, or you’d have to drive to the place yourself. When you budget the company sale’s meeting how do you figure out how much it will cost, in terms of time and money, just finding the right hotel? And if not having an answer wasn’t bad enough, the Sales Director, who’s paying for the convention, is fuming mad because this problem repeats itself every single year and you can never give an estimate, because the time it took you to find a hotel last time is not at all indicative of how long it will take next time.

If you now change the word “hotel” with “data” in the above example, you’ll see that it might be a tad extreme, but it is not so far-fetched. A Data Mesh is, in this sense, like a booking platform (think hotels.com or booking.com) where every data producer becomes a data owner and, just like a hotel owner, wants to be found by those who use the platform. In order to be listed, the federated governance, imposes some rules. The hotel owner must list a price (cost) and availability (uptime), as well as a structured description of the hotel (dataset) that includes the address, category and so forth, and all the required metadata (is there parking? a restaurant? a pool? breakfast is included? and so on). Each of these features becomes easily visible (it is always present and shown in the same place), searchable and can act as a filter. People who stay at the hotel (use the dataset) also leave reviews, which help both other customers in choosing and the hotel owner in improving the quality of its offer or at least the accuracy of the description.

The “self-serve” aspect is two-fold. From a user perspective it means that with such a platform the Sales department can choose and book the hotel directly, without needing (and paying for) the help of Logistics (Data Lake engineers). From an owner perspective (hotel or data owner) it means that they can independently choose and advertise what services to offer (rooms with air conditioning, Jacuzzis, butler service and so on) in order to meet and even exceed the customers’ wishes and demands. In the data world this second aspect relates to the freedom of data producers to autonomously choose their technology path, in accordance with the federated governance approved standards.

Last, but definitely not least, the Data Mesh architecture brings to the table the ease of scalability (once you have all the Hotels/datasets in one city, the system can grow to accommodate those of other cities, as well) and reuse. Reuse means that the effort you spent in creating a solution can, at least in part, be employed (reused) to create another. Let’s stick to the hotel analogy. If you created it last year and now want to do something similar for B&Bs, there is a lot you don’t have to redo from scratch. Of course, the “metadata” will be different (Bed and Breakfast don’t have conference rooms), but you can still use the same system of user feedback, the same technology to gather information on prices and availability, that once again will be up to the B&B’s owner to keep up to date, and so on.

A no brainer?

Put like that, it seems that “going with the Data Mesh” is a no-brainer. And that can be true for large corporations, but do keep in mind that building a Data Mesh is a mammoth task. If you only have three or four hotels, it goes without saying, it doesn’t make sense to build a booking platform. What’s important to keep in mind, though, is that a Data Mesh architecture, to express its full potential, requires a deep organizational change in the company. To cite the most obvious aspect, the data engineers need to “migrate” from the center (the Data Lake) to the data producers, to guide them in the process of properly preparing the data, conforming to the federated governance rules and exposing it correctly so that it can be found and utilized (thus also generating revenue through internal sales for the data owner). It also requires a change of mentality, so that the whole company can start to visualize data as a product, data producers as data owners and break free from the limitations and bottlenecks of a Data Lake, reaping the benefits of a truly distributed architecture and the new paradigm.

Luca Maestri, Chief Financial Officer of Apple, famously said that people tend to attribute the success of huge companies, the likes of Apple, Amazon, Google, or Facebook, to their being creative laboratories, where a great number of innovative ideas can emerge, but his experience taught him that this is not the case. These companies succeed because they are “execution machines”. In other words, a great idea has no value if you can not execute upon it effectively and quickly: on time and on budget. But, first, you need the right tools to be able to determine time and budget constraints. Creating a Data Mesh is a huge undertaking, but it means building the solid foundations that will support the evolution of your data-driven business. You can have all the data in the world into your Data Lake, but if you can’t leverage it, effectively and sustainably, you won’t move forward. Because in today’s world standing still means going backwards, the only way to stay competitive is to create new products, services, and solutions for your customers. In order to be an “execution machine”, you need to be able to spend your time looking for opportunities, instead of, searching for your data, analyzing the marketplace and chasing new clients and upset propositions, instead of rummaging through your Data Lake.
If you can do all of that, you definitely deserve a relaxing, rewarding week-end, to look at the placid lake from your house and remember the time when your Data Lake was equally still, unmoving and hard to see under the surface. Come Monday, it will be time for a new week of effective, efficient, data-driven new business.

 

On this topic, you might be interested in the How and why Data Mesh is shaping the data management’s evolution, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.

The rise of Big Data testing

What is big data testing?

Big data testing is the process of testing data for data and processing integrity; and to validate the quality of big data and ensure data is flawless to help business derive the right insights, according to Hitesh Khodani, Test Transformation Manager.

Marco Scalzo, Big Data Engineer at Agile Lab, adds that when we talk about big data, we refer to a large collection of data, which is potentially unbounded if we are in a real-time streaming model, on which we execute a pipeline of operations in order to process data and store them in an external sink such as a database.

Read the full article on The Software Testing News, the premium online destination for software testing news, reports, white papers, research and more.

Covering all aspects of software testing in all main verticals, you can be sure that Software Testing News will keep you informed and up to date.
The Software Testing News website gives you different possibilities to market directly to your customers. 

Read now!