Elite Data Engineering

Elite Data Engineering 

I think it’s time to freshen up the world’s perception of Data Engineering.

This article has the dual purpose of raising awareness and acting as a manifesto for all the people at Agile Lab along with our past, present, and future customers.

In the collective imagination, the Data Engineer is someone who processes and makes data available to potential consumers. It’s usually considered as a role focused on achieving a very practical result: get data there — where and when it’s needed.

Unfortunately, this rather short-sighted view led to big disasters in Enterprises and beyond and you know why?

Data Engineering is a practice, not (just) a role.

In the past, many professionals approached the “extract value from data” problem as a technological challenge, architectural at most (DWH, Data Lake, now Cloud-native SaaS based pipelines, etc …), but what has always been missing there is the awareness that Data Engineering is a practice, that as such needs to be studied, nurtured, built and disseminated.

At Agile Lab, we have always dissociated ourselves from this too simplistic vision of Data Engineering, conscious of possible misunderstanding by the market when expecting from us such kind of “basic service”. We’d like to state this once again: Data Engineering is not about getting data from one part to another or executing a set of queries quickly (thanks to some technological help), rather than “supporting” data scientists. Data Engineering, as the word itself says, means engineering the data management practice.

When I explain our vision, I like to use parallelism with Software Engineering: after so many years, it has entered into a new era where writing code has moved way beyond design patterns, embracing governance techniques and practices such as:

  • source version control
  • testing and integration
  • monitoring and tracing
  • documentation
  • review
  • deployment and delivery
  • …and more

This evolution wasn’t driven by naïve engineers with time to waste; instead, it just happened because the industry realized that software has a massive impact, over time, on everything and everyone. A malfunctioning software with poor automation, not secure by design, hardly maintainable or evolvable has an intrinsic cost (let’s call it Technical Debt) that is devastating for the organization that owns it (and, implicitly, for the ones using and relying on it).

To mitigate these risks, we all tried to engineer software production to make it sustainable in the long run, aware of the raise of up-front costs.

When we talk about Engineering, in a broader viewpoint, we aim precisely to this: designing detailed and challenged architectures, adopting methodologies, processes, best practices, and tools to craft reliable, maintainable, evolvable, reusable products and services that meet the needs of our society.

Would you walk over a bridge made of some wooden planks (over and over again, maybe to bring your kids to school)?

Now, I don’t understand why some people talk about Data Engineering when, instead, they refer to just creating some data or implementing ETL/ELT jobs. It’s like talking about Bridge Engineering while referring to putting some wooden planks between two pillars: here the wooden planks are the ETL/ELT processes connecting the data flows between two systems. Let me ask you: is this really how would you support an allegedly data-driven company?

I see a significant disconnection between business strategies and technical implementations: tons of tactical planning and technological solutions but with no strategic drivers.

It’s like if you want to restructure a country’s viability system and you start making some roads or wooden bridges here and there. No, I’m sorry, it doesn’t work. You need a strategic plan, you need investments and, above all, you need to know what you are doing. You need to involve the best designers and builders to ensure that the bridges don’t collapse after ten years (we should have learned something recently). It is not something that can be improvised. You cannot throw workers to do that without proper skills, methodology, and a detailed project while hoping to deliver something of strategic importance.

Fortunately, there has been an excellent evolution around data practice in the last few years, but real-world data-driven initiatives are still a mess. That doesn’t surprise me because, in 2022, there are still a lot of companies that produce poor-quality software (no unit-integration-performance testing, no CI/CD, no monitoring, quality, observability, etc…). This is due to a lack of engineering culture. People are trying to optimize and save the present without caring about the future.

Since the DevOps wave arrived, the situation has moved a lot. Awareness increased, but we are still far from having a satisfactory average quality standard.

Now let’s talk about data

Data is becoming the nervous system of every company, large or small, a fundamental element to differentiate on the market in terms of services offer and automated support to business decisions.

If we genuinely believe that they are such a backbone, we must treat them “professionally”; if we really consider them a good, a service, or a product, we must engineer them. There is no such thing as a successful product that has not been properly engineered. The factory producing such products has undoubtedly been super-engineered to ensure low production costs along with no manufacturing defects and high sustainability.

Eventually, you must figure out if data for your business is that critical and behave accordingly, as usually done with everything else (successful) around us.

I can proudly say that at Agile Lab we believe data is crucial, and we assume that this is just as true for our customers.

That’s why we’ve been trying to engineer the data production and delivery process for years from a design, operation, methodology, and tooling perspective. Engineering the factory but not the product, or vice versa, definitely leads to failure in the real world. It is important to engineer the data production lifecycle and the data itself as a digital good, product, and service. Data does not have only one connotation, instead it is naturally fickle. I’d define it as a system that must stay balanced because, exactly as in quantum systems, the reality of data is not objective but depends on the observer.

Our approach has always been to consider the data as the software (and in part, the data is software), but without forgetting the greater gravity (Data Gravity — in the Clouds — Data Gravitas).

Here I see a world where top-notch Data Engineers make the difference for the success of a data-centric initiative. Still, all this is not recognized and well-identified like software development because just a few people know what a good data practice looks like.

But that’s about to change because, with the rise of Data Mesh, data teams will have to shift gears, they won’t have anyone else to blame or take responsibility in their places, they’ll have to work to the best of their ability to make sure their products are maintainable and evolvable over time. The various data products and their quality will be measurable, objectively. Since they will be developed by different teams and not by a single one, it will be inevitable to make comparisons: it will be straightforward to differentiate teams working with high-quality standards from those performing poorly, and we sincerely look forward to this.

The revolutionary aspect of Data Mesh is that, finally, we don’t talk about technology in Data Engineering! Too many data engineers focus on tools and frameworks, but practices, concepts, and patterns matter also if not more: in fact, eventually technologies are all due to become obsolete, instead of good organizational, methodological, and engineering practices that can long last.

Time to show the cards

So let’s come to what we think are the characteristics and practices that should be part of a good Data Engineering process.

Software Quality:

The software producing and managing data must be high-quality, maintainable, evolvable. There’s no big difference between software implementing a data pipeline and a microservice in terms of quality standards! In terms of workflow, ours is public and open source. You can find it here

Software design patterns and principles must also be applied to data pipelines development. Please stop scripting data pipelines (notebooks didn’t help here).

Data Observability:

Testing and monitoring must be applied to data, as well as to software, because static test cases on artifact data are just not enough. We do this by embracing the principles of DODD (Kensu | DODD). Data testing means considering the environment and the context where data lives and is observed. Remember that data offers different realities depending on when and where it’s evaluated, so it’s important to apply principles of continuous monitoring to all the lifecycle stages, from development through deployment, not only to the production environment.

When we need to debug or troubleshoot software we always try to have the deepest possible context. IDEs or various monitoring and debugging tools help by automatically providing the execution context ( stack trace, logs, variables status, etc.) letting me understand the system’s state at that point in time. 

How often did you see “wrong data” in a table having no idea on how it got there? How exausting was to: hand the piece of code allegedly generating such data, hoping it’s versioned not hardcoded somewhere, tried to mentally re-create the corner case that might have caused the error, searched for unit tests supposedly covering such corner case, scraping the documentation hoping to find references of that scenario… It’s frustrating, because data seems just plunged from the sky, with no traces left. I think 99% of data pipelines implemented out there have no links between the input (data), the function (software) and the output (data) thus making troubleshooting or, even worse, the discovery of an error just a nightmare.

In Agile Lab, we apply the same principles seen for software: we always try to have as much context as possible to understand better what’s happening to a specific data row or column. We try to set up lineage information (stacktrace), data profiling (general context) and data quality before and after the transformations we apply (execution context). All these information need to be usable right when needed and be well organized.

“OK you convinced me: which tool can make all of this possible ?”

Well, the reality is that practice is about people — professionals — not tools. Of course, many tools help in implementing the practice, but without the right mindsets and skills they are just useless.

“But I have analysts creating periodic reports and they can spot unexpected data values and distributions”

True, but ex-post checks don’t solve the problem of consumers having probably already ingested faulty data (remember the data gravity thing?) and they don’t provide any clues about where and why the errors came up. This to say that it makes a huge difference to proactively anticipate checks and monitoring so to make them synchronous with the data generation (and part of applications-generating-data lifecycle).

Decoupling:

It has become a mantra in software systems to have well-decoupled components that are independently testable and reusable.

We use the same measures with different facets in data platforms and data pipelines.

A data pipeline must:

  • Have a single responsibility and a single owner of its business logic.
  • Be self-consistent in terms of scheduling and execution environment.
  • Be decoupled from the underlying storage system.
  • I/O actions must be decoupled from business (data transformation) logic, so to facilitate testability and architectural evolutions.

Also, it is essential to have well-decoupled layers at the platform level, especially storage and computing.

Reproducibility:

We need to be able to re-run data pipelines in case of crashes, bugs, and other critical situations, without generating side effects. In the software world, corner case management is fundamental, just like (or even more) in the data one (because of gravity).

We design all data pipelines based on the principles of immutability and idempotence. Despite the technology, you should always be in the position to:

  • Re-execute a pipeline gone wrong.
  • Re-execute an hotfixed/patched pipeline to overwrite the last run, even if it was successful.
  • Restore a previous version of the data.
  • Run a pipeline in recovery mode if you need to clean up an inconsistent situation. The pipeline itself holds the cleaning logic, not another job.

So far, we have talked about data itself, but even more critical aspect is the underlying environments and platforms status reproducibility. Being able to (re)create an execution environment with no or minimum “manual” effort could really make the difference for on-the-fly tests, no-regression tests on large amounts of data, environment migrations, platform upgrades, switch on/simulate disaster recovery. It’s an aspect to be taken care of and can be addressed through IaC and architectural Blueprint methodologies.

Data Versioning:

The observability and reproducibility issues are rooted in the ability to move nimbly on data from different environments as well as from different versions in the same environment (e.g., rollback a terminated pipeline with inconsistent data).

Data, just like software, must be versioned and portable with minimum effort from one environment to another.

Many tools can help (such as LakeFS), but when they are not available it’s still possible to set up a minimum set of features allowing you to efficiently:

  • · Run a pipeline over an old version of data (e.g., for repairing or testing purposes)
  • Make an old version of the data available in another environment.
  • Restore an old version of the data to replace the current version.

It is critical to design these mechanisms from day zero because they can also impact other pillars.

Vulnerability Assessment:

It is part of an engineer’s duties to take care of security, in general. In this case, data security is becoming a more and more important topic and must be approached with a by-design logic.

It’s essential to look for vulnerabilities in your data and do it in an automated and structured way so to discover sensitive data before they go accidentally live.

This topic is very important at Agile Lab, so much that we have even developed proprietary tools to promptly find out if sensitive information is present in the data we are managing. These checks are continuously active, just like Observability checks.

Data Security at rest:

Fine-grained authentication and authorization mechanisms must exist to prevent unwanted access to (even just portion of) data (see RBAC, ABAC, Column and Row filters, etc…).

Data must be protected, and processing pipelines are just like any other consumer/producer so policies must be applied to storage and (application)users, that’s also why decoupling is so important.

Sometimes, due to strict regulation and compliance policies, it’s not possible to access fine-grained datasets, although aggregates or statistical calculations might still be needed. In this scenario, Privacy-Preserving Data Mining (PPDM) comes in handy. Any well-engineered data pipeline should compute its KPIs with the lowest possible privileges to avoid leaks and privacy violations (auditing arrives suddenly and when you don’t expect it). Have fun the moment you need to ex-post troubleshoot a pipeline that embeds PPDM, which basically gives you no data context to start from!

Data Issues:

Data Observability chapter provides capabilities to perform root cause analysis, but how do we document the issues? How do we behave when a data incident occurs? The last thing you wanna do is to “patch the data”; what you need to do is fixing the pipelines (applications), which will then fix the data.

First, the data issues should be reported just like any other software bug. They must be documented accurately, associating them with all the contextual information we extracted from the data observability, to clarify what would have been expected, what happened instead, what tests have been prepared and why they did not intercept the problem.

It is vital to understand the corrective actions at the application level (they could drive automatic remediation) as well as the additional tests that need to be implemented to cover the issue.

Another aspect to consider is the ability to associate data issues to the specific version of both software and data (data versioning) responsible of the issue because, sometimes, it’s necessary to put in place workarounds to restore business functionality first. At the same time, while data and software evolve independently, we must reproduce the problem and fix it.

Finally, to avoid repeating the same mistakes, it’s crucial to conduct a post-mortem with the team and ensure these are saved in a simil-wiki or somehow associated with the issue, searchable and discoverable.

Data Review:

Just as code reviews are a good practice, a good data team should perform and document data reviews. These are carried out at different development stages and cover multiple aspects such as:

  • Documentation completeness
  • Semantic and syntactic correctness
  • Security and compliance
  • Data completeness
  • Performance / Scalability
  • Reproducibility and related principles
  • · Test cases analysis and monitoring (data observability)

We encourage adopting an integrated approach across code base, data versioning, and data issues.

Documentation and metadata:

Documentation and metadata should follow the same lifecycle of code and data. Computational governance policies of the data platform should prevent new data initiatives to be brought into production if not well paired with the proper level of documentation and metadata. In fact, this aspect in the Data Mesh will become a foundation for having good data products that are genuinely self-describing and understandable.

Having ex-post metadata creation and publishing can introduce a built-in technical debt due to the potential misalignment with the actual data already available to consumers. It’s then good practice to make sure that metadata resides within the same repositories of pipelines’ code generating the data, which should imply synchronization between code and metadata and ownership by the very same team. For this reason, metadata authoring shouldn’t take place on Data Catalogs, which are supposed to help consumers find out the information they need. Embedding a metadata IDE to cover malpractices in metadata management won’t help in cultivating this culture.

Performance:

Performance and scalability are two engineering factors that are critical in every data-driven environment. In complex architectures, there can be hundreds of aspects with an impact on scalability and good data engineers should possess them as foundational design principles.

Once again, I want to highlight the importance of a methodological approach. There are several best practices to encourage:

  • Always isolate the various read, compute, and write phases by running dedicated tests that will point out specific bottlenecks thus allowing better-targeted optimization: when you’re asked to reduce the execution time of a pipeline by 10%, it must be clear which part of the process is worth working on. A warning here: isolating the various phases brings up the theme of decoupling and single responsibility. If your software design doesn’t respect these principles, it won’t be modular enough to allow clear steps isolation. For example, Spark has a lazy execution model where it is very hard to discern read times from transformation and write times if you are not prepared to do so.
  • Always run several tests on each phase with different (magnitudes of) data volumes and computing resources to extract scaling curves that will allow you to make reasonable projections when the reference scenario will change: it’s always unpleasant to find out that input data is doubling in a week and we have got no idea of how many resources will be needed to scale up. Worst case scenario to performance-test pipelines against should be designed with long-term data strategy in mind.
  • Always run several tests on each phase with different (magnitudes of) data volumes and computing resources to extract scaling curves that will allow you to make reasonable projections when the reference scenario will change: it’s always unpleasant to find out that input data is doubling in a week and we have got no idea of how many resources will be needed to scale up. Worst case scenario to performance-test pipelines against should be designed with long-term data strategy in mind.

Wrapping up

If you’ve come this far, you’ve understood what Data Engineering means to us.

Elite Data Engineering is not focusing only on data. Still, it guarantees that the data will be of high quality, that it will be easy to find and correct errors, that it will be simple and inexpensive to maintain and evolve, and that it will be safe and straightforward to consume and to understand.

It’s about practice, data culture, design patterns, broader sensitivity across many more aspects we might be used to when “just dealing with software”. At Agile Lab, we invest every single day in continuous internal training and mentoring on this, because we believe that Elite Data Engineering will be the only success factor of every data-driven initiative.

If you’re interested in other articles on the Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

 

January 2022

What’s in store  |  January 2022

 


 

 

The CDP migration is coming up, are you ready to face this new challenge? 
Agile Lab is willing to work with you to support and guide you with this key process. 
Learn how!

 

 

DATA MESH IN ACTION
Another interesting article by Paolo Platter, Agile Lab CTO, on Data Mesh and how “Data Product Flow” can help companies in identifying their Data Products.
Read now!
 

 

 

IDC Analyst Brief: Accelerate data-driven transformation with Data Mesh
Why Data Mesh can promote a faster, more efficient and more scalable data-driven transformation process.
Download the report!

 

 

Next Thursday, February 3rd, at 18.30 CET don’t miss the talk organized by Agile Lab Bari and with Nicola Guglielmi, Cloud Solution Architect, as speaker. It will be shown a use case of GCP for developing an analytics solution in a serverless environment.
Sign up soon!

 

 

Next February 24th, at 19 CET, Agile Lab Veneto is going to run a Tech&More about Materialize.
With
Marta Paes, a Senior DevEx Engineer at Materialize, we’ll cover the basic concepts of streaming SQL and highlight what makes Materialize unique in comparison to other tools.
Sign up soon!

 

 


The book is a leadership fable by Patrick Lencioni.
It is about overcoming the five dysfunctionalities of a team according to the author, in order: absence of trust, fear of conflict, lack of commitment, avoidance of accountability, inattention to results.
The book is divided into two parts, the first is more theoretical describing causes and countermeasures for each dysfunctionality, the second is the narrative part of the book, it goes over the same points but from the perspective of a leader who is trying to solve these problems.

 

 

Do you know how many people we have employed in 2021? …27!
And we plan to hire as many, if not more, in 2022.
We are a young and dynamic company, always looking for the best talent on the market. Would you like to be a part of our team?
Have a look at our open positions and send us your CV!

 

10 practical tips to reduce Data Mesh’s adoption roadblocks and improve domains’ engagement

10 practical tips to reduce Data Mesh’s adoption roadblocks and improve domains’ engagement

Photo by Roberto Coluccio 

Hi! If you got here is probably because:

A) you know what Data Mesh is;

B) you are about to start (or already started) a journey towards Data Mesh;

C) you and your managers are sure that Data Mesh is the way to sky-rocket the data-driven value for your business/company; but

D) you are among the few who have bought the thing in and around you there are skepticism and worries.

The good news is: you’re DEFINITELY not alone! 

At Agile Lab we’re driving Data Mesh journeys for several customers of ours and, every time a new project kicks off, this roadblock is always one of the first to tackle.

Hoping to help out anyone just approaching the problem and/or looking for some tips, here are 10 ways we found out being somehow successful in real Data Mesh projects. We’re gonna group them by “area of worry”.

BUDGET 💸

Photo by Karolina Grabowska from Pexels

Well, you know, in the IT world every decision is usually based on the price VS benefit trade-off.

 Domains are expected to start developing Data Products as soon as possible and we need them aboard, but usually “there’s no budget for new developments”, or “there’s no budget to enlarge the cluster size”. There you go:

1. Compute resources to consume data designed as part of consumers’ architecture

This is a general architectural decision rather than a tip: you should design your output port (and their consumption model) so to provide shared governed access to potential consumers: with simple ACLs over your resources, you can allow secure read-only access to your Data Products’ output ports so that no compute resources are necessary at the (source) domain/Data Product level in order to consume that data, all the power is required (AND BILLED) at the consumers’ sides.

Ok, so let’s say we figured out a way to avoid (reduce?) TCO for our Data Products. But I’m what about the budget to develop them? 

2. Design a roadmap of data products creation that makes use of available operational systems maintenance budget streams

In complex organizations there are usually several budget streams flowing for operational systems maintenance: leverage them to create Source Aligned Data Products (which are flawlessly part of the domain so nobody will argue)!

Very well, we’re probably now getting some more traction, but why should we maintain long-term ownership over this stuff?

3. Pay-as-you-consume billing model

I know, this is bold and, so far, still in the utopia sphere. But there’s a key point to catch: if you figure out a way to “give something back” to Data Products’ creators/owners/maintainers, being that extra budget for every 5 consumers of an output port, or participation by the consumers to the maintenance budget … anything can pull up the spirit and diffuse the “it’s not a give-only thing this one” mood. If you get it, you win.

But let’s say I’m the coordinator of the Data Mesh journey and I received a budget just for a PoC… well, then:

4. Start small, with the most impactful yet less “operative business-critical” consumer-driven data opportunity

No big bang, don’t make too much loud (yet), pick the right PoC use case and make it your success story to promote the adoption company-wise, without risking too much.

This will probably make the C-level managers happy. The same ones who are repeatedly asking for the budget to migrate (at least) the analytical workloads to the Cloud, without receiving it “because we’re still amortizing the expense for that on-prem system”. Well, here’s the catch: if they want you to build a Data Mesh, you’re gonna need an underline stack that allows you to develop the “Self-Serve Infrastructure-as-a-Platform”. That’s it:

5. Leverage the Data Mesh journey as an opportunity to migrate to the Cloud

Your domain teams will probably be happy to fast forward to the present days technology-wise and work on Cloud-based solutions.

 

This last tip brings to light the other “area of worry”:

THE UNKNOWN ⚠

Photo by Jens Johnsson from Pexels

Data Mesh is an organizational shift that involves processes and, most of all, people.

People are usually scared of what they don’t know, especially in the work environment where they might be “THE reference” to everybody’s eyes for system XY, while they’d feel losing power and control by “simply becoming part of a decentralized domain’s Data Product Team”. 

First of all, you’re gonna need to:

6. Train people

Training and mentoring of key (influencer?) people across the domains is a successful-proven way of making people UNDERSTAND WHY the company decided to embrace this journey. Data Mesh Experts should mentor the focal point so that they can, in their turn, spread the positive word within their teams.

But teams are probably scared too of being forced to learn new stuff, so:

7. Make people with “too legacy technical background” understand this is the next big wave in the data world

Do they really wanna miss the train? Changes like this one might come once every 10 or 20 years in big organizations: don’t waste the opportunity and get out of that comfort zone!

Nice story, so are you telling me we have organizational inefficiencies and we don’t make smart use of our data to drive or automate business decisions? Well, my dear, it’s not true: MY DOMAIN publishes data that I’ve been told (by a business request) should be used somewhere.

I see you, but understand that your view might be limited. There are plenty of others complaining about change management issues, dependencies issues, slowness in pushing evolutions and innovations, lack of ownership and quality, etc.. etc.. etc …

8. Organize collaborative workshops where business people meet technical people and share pain points from their perspectives

You will be surprised by the amount of “hidden technical debt” that will come out, and people will start empathizing with each other: this will start transferring the “My data can bring REAL VALUE” (because real people are complaining in front of them) to the organization.

Does this end once the Data Products have been released? Absolutely not! We talked about «long-term ownership» so we need to provide «long-term motivation»:

9. Collect clear insights about the value produced with data and make it accessible to everybody within the organization (through the platform)

This can start as simple as «Number of consumers of a Data Product» (what if it’s zero?) and evolve in more complex metrics like the time-to-market (it’s expected to be reduced, it’s the time to bring into production a new Data Product), the network effect (the more the interconnections across Domains, the more value is probably added afterwards), and other metrics more specific and valuable to your corporation.

You convinced me. I will develop Data Products. Where do I start?

10. Organize collaborative workshops to identify data opportunities

Following the DDD (Domain Driven Design) practice the first pillar of Data Mesh is based on, Domain-oriented Decentralized Ownership, and adding it up to the Data-as-a-Product one, we get a “business decisions”-driven design model to identify data opportunities (i.e. Data Products to be created in order to support/automate data-driven business decisions).

 

On this topic, you might be interested in the Data Product Flow we designed at Agile Lab, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

Author: Roberto Coluccio, Agile Lab Data Architect

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.

Customer 360 and Data Mesh: friends or enemies?

Customer 360 and Data Mesh: friends or enemies?

Raise a hand who saw, was asked to design, tried to implement, struggled with the “Customer 360” view/concept in the last 5+ years…

Come on, don’t be shy …

How many Customer 360 stories did you see succeeding? Me? Just … a few, let’s say. This made me ask: why? Why put so much effort into creating a monolithic 360 view thus creating a new maintenance-evolution-ownership nightmare silo? Some answers might fit here, in the context of centralized architectures, but recently a new antagonist to fight against came in town: Data Mesh.

“Oh no, another article about the Data Mesh pillars!”

No, I’m not gonna spam the web with yet-another-article about the Data Mesh principles and pillars, there’s plenty out there, like thisthis, or those, and if you got here is because you might already know what we’re talking about.

The question that arises is:

How does the (so struggling or never really achieved) Customer 360 view fit into the Data Mesh paradigm?

Well, in this article I’ll try to come up with some options, coming from some real AgileLab’s customers approaching the “Data Mesh journey”.

This fascinating principle, inherited (in terms of effectiveness) from the microservices world, brings to light the necessity to keep together tech and business knowledge, within specific bounded contexts, so as to improve autonomy and velocity of data products lifecycle along with a smoother change management. Quoting Zhamak Dehghani:

Eric Evans’s book Domain-Driven Design has deeply influenced modern architectural thinking, and consequently the organizational modeling. It has influenced the microservices architecture by decomposing the systems into distributed services built around business domain capabilities. It has fundamentally changed how the teams form, so that a team can independently and autonomously own a domain capability.

Examples of well-known domains are Marketing, Sales, Accounting, Product X, or Sub-company Y (with subdomains, probably). All of these domains have probably something in common, right? And here we connect with the prologue: the customer. We all agree on the importance of this view, but eventually do we really need this to be materialized as a monolithic entity of some sort, on a centralized system?

I won’t answer this question, but I’ll make another one:

Which domain would a customer 360 view be part of?

Remember: decentralization and domain-driven design imply having clear ownership which, in the Data Mesh context, means:

  • decoupling from the change management process of other domains
  • owning the persistence (hence: storage) of the data
  • provide valuable-on-their-own views at the Data Products’ output ports (is a holistic materialized 360 customer view really valuable-on-its-own?) — this is part of the “Data as a Product” pillar, to be precise
  • guarantee no breaking changes to domains that depend on the data we own (i.e. own the full data lifecycle)
  • (last but not least) having solid knowledge (and ownership) of the business logic orbiting around the owned data. Examples are: knowing the mapping logic between primary and foreign keys (in relational terms), knowing the meaning of every bit of information part of that data, knowing what are the processes that might influence such data lifecycle

If you can’t now answer the above question DON’T WORRY: you’re not alone! It just means you’re starting as well to feel the friction between the decentralized ownership model and the centralized customer 360 view.

IMHO, they are irreconcilable. Here’s why, with respect to the previous points:

  • centralized ownership over all the possible customer-related data would couple the change management processes of all the domains: a 360 view would be the only point-of-access for customer data, thus requiring strong alignment across the data sources feeding the customer 360 (which usually would end up slooooowing down every innovation or evolution initiative). Legit question: “So, this means we should never join together data coming from different domains?” — Tough but realistic answer: “You should do that only if it makes sense, if it produces clear value, if it doesn’t require further interpretation or effort on the consumers’ side to grasp such value”.
  • If you bind together data coming from different domains, it must be to create a valuable-on-its-own data set. Just the fact that you read, store and expose data would make you the owner of it, that’s why a customer 360 persisted view would make sense only actually owning the whole dataset, not just a piece of it. Furthermore, Data Products require immutable and bi-temporal data at their output ports, can you imagine the effort of delivering such features over such a huge data asset?
  • valuable-on-its-own Data Products, well, I hope the previous 2 points already clarified this point 😊
  • If we sort out all the previous issues and actually become the owner of a customer 360 persisted materialized view, while still being compliant to the Data Mesh core concepts, we should never push breaking changes towards our data consumers. If we need to do make a breaking change, we must guarantee sustainable migration plans. Data versioning (like LakeFS or Project Nessie) could come in handy in this case, but there might be several organizational and/or technical ways to do so. Having many dependencies (like we’d have owning a customer 360 view) would make it VERY hard to always have a smooth data lifecycle and avoid breaking changes since we could find ourselves just overwhelmed by breaking changes made at the source operational systems side.
  • The whole Data Mesh idea emerged as a decentralization need, after seeing so many “central data engineering / IT teams” fall under the pressure of all the organization which was relying on them to have quality data. It was (is) a bottleneck in many cases worsened by the fact that these centralized tech teams couldn’t just have sufficiently strong business knowledge of every possible domain (while unfortunately they were developing and managing all the ETLs/ELTs of the central data platform). For this very same reason, the Customer 360 team would be required to master the whole business logic around such data, i.e. being THE acknowledged experts of basically every domain (having customer-related data) — just impossible.
a customer360 logical view, with domains owning only a slice of it

OK, I’m done with the bad news 😇 Let’s start with the good ones!

Customer 360 view and Data Mesh can be friends

As long as it remains a holistic logical/business concept. Decentralized domains should own the data related to their bounded contexts, even if referring to the customer. Domains should own a slice of the 360 view and slices should be correlated together to create valuable-on-their-own Data Products (e.g. just a join between sales and clickstreams doesn’t provide any added value, but a behavioral pattern on the website with the number of purchases per customer with specific browsing-behaviors do).

In order to achieve that, domains and – more in general – data consumers must be capable of correlating customer-related data across different domains with ease. They must be facilitated on the joining logic level, which means they should NOT be required to know the mapping between domain-specific keys, surrogate keys, and customer-related keys. I’ll try to expand this last element.

Globally Unique Identifiable Customer Key

According to the Data Mesh literature, data consumers shouldn’t always copy (ingest in the first place) data in order to do something with it, since storing data means having ownership over it. As a technical step justified by volumes or other constraints for a single internal ETL step, that’s ok, but Consumer Aligned Data Products shouldn’t just pull data from another Data Product’s output port, perform a join or append another column to the original dataset, and publish at their output port an enhanced projection of some other domain’s data, for many reasons, but I won’t digress on this (again, it’s part of the Data as a Product pillar). What data consumers need is a well-documented, globally identifiable, and unique key/reference to the customer, in order to perform all the possible correlations across different domains’ data. The literature would also call them polysemes.

Notewell-documented should become a computational policy (do you remember the Federated Governance Data Mesh’s pillar?) requiring for example that, in the Data Product’s descriptor (want to contribute to our standardization proposal?), a specific set of metadata must describe where a certain field containing such key comes from, which domain generated it, what business concepts it points to. Maybe that could be done also leveraging a syntax or language that will facilitate the automated creation (thank you Self-Serve Infrastructure-as-a-Platform) of a Knowledge Graph afterward.

OK, the concepts of polysemes and the globally unique identifiable customer-related key are not new to you, and they shouldn’t, but what I want to put the lights on is how to conjugate it with the Data Mesh paradigm, especially because several architectural patterns are technically possible but just a few (one?) of them should guarantee long term scalability and compliance with the Data Mesh pillars.

An MDM is still a good alley

The question you might have asked yourself at this point is:

Who has the ownership of issuing such a key?

The answer is probably in the good-old-friend the Master Data Management (MDM) system, where the customer’s golden record data can be generated. There’s a lot of literature (example) on that concept, also because it’s definitely not new in the industry (and that’s why a lot of companies approaching Data Mesh are struggling to understand how to make it fit in the picture since it can just be thrown away after all the effort spent on building it up).

Note: eventually, such a transactional/operational system might also lead to developing a related Source Aligned Data Product, but it should just be considered as a source of the customers’ registry information.

OK, but who should be then responsible of performing the reverse lookup, to map a domain-specific record (key) to the related customer (golden record’s) key?

We narrow down the spectrum into 3 possible approaches. To facilitate the reading of what follows, I’d like to point out a few things first:

  • SADP = Source Aligned Data Product
  • CADP = Consumer Aligned Data Product
  • CID = Customer MDM unique globally identifiable key
  • K = example of a domain-specific primary key
Approach 1: The Ugly
Approach 2: The Bad
Approach 3: The Good

The latter approach is what could make a customer 360 view shine again! This way, domains — and in particular the operational systems’ teams- preserve the ownership of the mapping logic between the operational system’s data key and the CID, while in the analytical plane no other interactions with the operational one are required to create Source Aligned Data Products other than pulling data from the source systems, and at the Consumer Aligned Data Products sides no particular effort is spent to correlate customer-related data coming from different domains.

Disclaimer 1: approach 3 requires strong Near-Real-Time alignment between the Customer MDM system and the other operational systems since a misalignment could imply publishing data from the various domains not attached to a brand new available CID (many MDM systems operate their match-and-merge logic in batch, or many operational systems might think to update-set the CID field with batch reverse lookups quite after domain data is generated).

Disclaimer 2: a fourth approach could be possible, i.e. having microservices issuing in NRT global IDs which become master key in all the domains having customer-related data. This is usually achievable in custom implementations only.

Wrapping up

In the Data Mesh world, domain-driven decentralized ownership over data is a must. The centralized Customer 360 monolithic approach doesn’t fit the picture, so a shift is required in order to maintain what actually matters the most: the business principle of customer-centric insights, derived from the correlation of data taken from different domains (owners) via a well documented and defined globally unique identifier, probably generated at Customer MDM level and integrated “as left as possible” into the operational systems, so to reduce the burden at the Data Mesh consumers’ side of knowing and applying the mapping logic between domain-specific key and Customer MDM key.

. . . 

What has been presented has been discussed with several customers and it seemed to be the best option so far. We hope the hype around Data Mesh will bring some more options in the very next years and will keep our eyes open but, if you already put in place a different approach, I’ll be glad to know more.

 

If you’re interested in other articles on the Data Mesh topics, you can find them on our blog, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

Author: Roberto Coluccio, Agile Lab Data Architect

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.