Data Product Specification: power up your metadata layer and automate your Data Mesh with this practical reference

Data Product Specification: power up your metadata layer and automate your Data Mesh with this practical reference

Just like every informative content in the Data Mesh area should do, I’d like to start quoting the reference article on the topic, by Zhamak Dehghani:

[…] For data to be usable there is an associated set of metadata including data computational documentation, semantic and syntax declaration, quality metrics, etc; metadata that is intrinsic to the data e.g. its semantic definition, and metadata that communicates the traits used by computational governance to implement the expected behavior e.g. access control policies.

A “set of metadata” is hereby to be associated with our Data Products. I’ve read a lot of articles (I’m gonna report some references along the way in the article) about why it’s important and what theoretically can be achieved by leveraging them… they all make a lot of sense. But — as frequently happens when someone talks about Data Mesh related aspects — there’s often a lack of “SO WHAT”.

For this reason, I’d like to share, after a brief introduction, a practical approach we are proposing at Agile Lab. It’s open-source, it’s absolutely not perfect for every use case, but it’s evolving along with our real-world experience of driving enterprise customers’ Data Mesh journeys. Most importantly, it’s currently used — in production, not just on articles — at some important customers of ours to enable the automation needed by their platform supporting their Data Mesh.

Why metadata are so important

I believe that automation is the only way to survive in a jungle of domains willing to create Data Products. Automation with respect to:

  1. easy and fast creation of Data Products leveraging pre-built reusable templates that guarantee out-of-the-box compliance to the Data-as-a-Product principle;
  2. implementation and execution of the Federated Governance policies (which therefore become “computational”);
  3. enablement of Self-Service Infrastructure, as a fundamental pillar to provide Decentralized Domains the autonomy and ownership they need.

The pillars together are all mandatory to reach the scale. They are complex to set up and spread, both as cultural and technical requirements, but they are foundational if we want to build a Data Mesh and harvest that so-widely promised value from your data.

But how can we get to this level of automation?
By leveraging a metadata model representing/associated with a Data Product.

Such a model must be general, technology agnostic, and imagined as the key enabler for the automation of a Data Mesh platform (to be intended as an ecosystem of components and tools taking actions based on specific sections of this metadata specification, which must be standardized to allow interoperability across the platform’s components and services that need to interact with each other).

State of the art

No clear sight due to an Atlantic perturbation — 2014, Madeira Island.
Photo by Roberto Coluccio — all rights reserved.

In order to provide a broader view on the topic, I think it’s important to report some references to understand what brought us to do what we did (OK, now I’m creating hype on purpose 😁).

  • Feed the Control Ports: Arne Roßmann in this article describes how control ports of Data Products should serve foundational APIs by leveraging metadata collected at different levels.
  • Active Metadata: Prukalpa Sankar in this article mentions a commercial product tackling the active metadata space as a way to autonomously collect precious pieces of information about several things from SQL plans to understand data lineage, to monitor Data Products update rate, and other very cool stuff.
  • Metadata types: Zeena in this post introduces their Data Catalog solution after reporting a very interesting differentiation of metadata into DESCRIPTIVE, STRUCTURAL, ADMINISTRATIVE.
  • Standards for Metadata in generalOpenMetadata is doing excellent work in creating standards and a framework for metadata management, including specific vocabulary and syntax (even if in some cases it becomes quite complex IMHO).
  • General Data Product definition: what gets closer to our vision is OpenDataProducts, a good work to factor out some important information around the general Data Product concept (e.g. Billing and SLA). However, generalizing too much is risky and, in this case, I believe they got a little too far from the way Data Mesh requires a Data Product to be described and built. For example, there’s no mention of Domains, Schema, Output Ports, or Observability.

According to these references, there’s still no clear metadata-based representation of a Data Product addressing — specifically — the Data Mesh and its automation requirements.

How we intend a data product specification

We believe in an Infrastructure-As-Code declarative, idempotent, and versioned approach. The goal is to have a standardized yet customizable specification addressing this holistic representation of the Data Mesh’s architectural quantum that is the Data Product.

Principles involved:

  • Data Product components (input ports, output ports, workloads, staging areas)
  • self-documenting Data Product (from schema to dependencies, to description and access patterns)
  • clear ownership (from Domain to Data Product team)
  • semantic linking with other Data Products (like polysemes to help improve interoperability and references to build a semantic layer or Knowledge Graph on top of the Data Mesh)
  • data observability
  • SLA/SLO and Data Contracts (expectations, agreements, objectives)
  • Access and usage control

The Data Product Specification aims to gather together crucial pieces of information related to all these aspects, under the strict ownership of the Data Product Owner.

What we did so far

In order to fill up this empty space, we tried to create a Data Product Specification by starting this open-source initiative:

https://github.com/agile-lab-dev/Data-Product-Specification

The repo contains a detailed documentation field-by-field; however, I’d like to point out here some features I believe to be important:

  • One point of extension and — at the same time — standardization is the expectation that some of the fields (like tags) are expressed in the OpenMetadata specification.
  • Depending on the context, several of those fields can be considered optional: their valorization is supposed to be checked by specific federated governance policies.
  • Although there’s an aim for standardization, the model is expandable for future add-ons: your Data Mesh (and platform) will probably keep evolving for a long time and the Data Product Specification is supposed to do the same, along the way. You can start simple, then add pieces by pieces (and related governance policies).
  • being in YAML format it’s easy to parse and validate, e.g. via a JSON parsing library like circe.

The Data Product Specification itself covers the main components of a Data Product:

  • general bounded context and domain metadata the Data Product belongs to (to support descriptive API on control ports)
  • clear ownership references
  • input ports, workloads, internal storage/staging areas, output ports, and logical dependencies between them
  • data observability
  • data documentation, semantic linking info, schema, usage guidelines (from serialization format and access patterns to billing models)

NOTE: what is here presented as a “conceptual” quantum can be (and it is, in some of our real-world implementations) split into its main componing parts that are then git-version controlled under their own repositories (belonging to a common group, which is the Data Product one).

Key benefits

The Data Product Specification is intended to be heavily exploited by a central Data Mesh-enabling platform (as a set of components and tools) for:

  • Data Products creation boost, since templates can be created out of this specification (for the single components and for the whole Data Products). They can then be evolved, also hiding some complexity from end-users.
  • Federated Governance policies execution (e.g. a policy statement could be “the DP description must be at least 300 chars long, the owner must be a valid ActiveDirectory user/group, there must be at least 1 endpoint for Data Observability, the region for the S3 bucket must be US-East-1”, etc…), thus making “approved templates” to be compliant by-design with the Data Mesh principles and policies.
  • Change Management improvement, since specifications are version controlled and well structured, thus allowing to detect breaking changes between different releases.
  • Self-service Infrastructure enablement, since everything a DP needs to be provisioned and deployed for is included in a specific and well-known version of the DP specification, thus allowing a platform to break it down and trigger the automated provisioning and deployment of the necessary storage/compute/other resources.

As I said, this specification is evolving and can be surely improved. Being open-source, every contribution is welcome.

Wrapping up

How can a Data Mesh journey reach the scale? With automation to guarantee out-of-the-box compliance of Data Products to the Data Mesh principles. 

If you’re interested in other articles on the Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Thank you for reading.

Author: Roberto Coluccio, Agile Lab Data Architect

Elite Data Engineering

Elite Data Engineering 

I think it’s time to freshen up the world’s perception of Data Engineering.

This article has the dual purpose of raising awareness and acting as a manifesto for all the people at Agile Lab along with our past, present, and future customers.

In the collective imagination, the Data Engineer is someone who processes and makes data available to potential consumers. It’s usually considered as a role focused on achieving a very practical result: get data there — where and when it’s needed.

Unfortunately, this rather short-sighted view led to big disasters in Enterprises and beyond and you know why?

Data Engineering is a practice, not (just) a role.

In the past, many professionals approached the “extract value from data” problem as a technological challenge, architectural at most (DWH, Data Lake, now Cloud-native SaaS based pipelines, etc …), but what has always been missing there is the awareness that Data Engineering is a practice, that as such needs to be studied, nurtured, built and disseminated.

At Agile Lab, we have always dissociated ourselves from this too simplistic vision of Data Engineering, conscious of possible misunderstanding by the market when expecting from us such kind of “basic service”. We’d like to state this once again: Data Engineering is not about getting data from one part to another or executing a set of queries quickly (thanks to some technological help), rather than “supporting” data scientists. Data Engineering, as the word itself says, means engineering the data management practice.

When I explain our vision, I like to use parallelism with Software Engineering: after so many years, it has entered into a new era where writing code has moved way beyond design patterns, embracing governance techniques and practices such as:

  • source version control
  • testing and integration
  • monitoring and tracing
  • documentation
  • review
  • deployment and delivery
  • …and more

This evolution wasn’t driven by naïve engineers with time to waste; instead, it just happened because the industry realized that software has a massive impact, over time, on everything and everyone. A malfunctioning software with poor automation, not secure by design, hardly maintainable or evolvable has an intrinsic cost (let’s call it Technical Debt) that is devastating for the organization that owns it (and, implicitly, for the ones using and relying on it).

To mitigate these risks, we all tried to engineer software production to make it sustainable in the long run, aware of the raise of up-front costs.

When we talk about Engineering, in a broader viewpoint, we aim precisely to this: designing detailed and challenged architectures, adopting methodologies, processes, best practices, and tools to craft reliable, maintainable, evolvable, reusable products and services that meet the needs of our society.

Would you walk over a bridge made of some wooden planks (over and over again, maybe to bring your kids to school)?

Now, I don’t understand why some people talk about Data Engineering when, instead, they refer to just creating some data or implementing ETL/ELT jobs. It’s like talking about Bridge Engineering while referring to putting some wooden planks between two pillars: here the wooden planks are the ETL/ELT processes connecting the data flows between two systems. Let me ask you: is this really how would you support an allegedly data-driven company?

I see a significant disconnection between business strategies and technical implementations: tons of tactical planning and technological solutions but with no strategic drivers.

It’s like if you want to restructure a country’s viability system and you start making some roads or wooden bridges here and there. No, I’m sorry, it doesn’t work. You need a strategic plan, you need investments and, above all, you need to know what you are doing. You need to involve the best designers and builders to ensure that the bridges don’t collapse after ten years (we should have learned something recently). It is not something that can be improvised. You cannot throw workers to do that without proper skills, methodology, and a detailed project while hoping to deliver something of strategic importance.

Fortunately, there has been an excellent evolution around data practice in the last few years, but real-world data-driven initiatives are still a mess. That doesn’t surprise me because, in 2022, there are still a lot of companies that produce poor-quality software (no unit-integration-performance testing, no CI/CD, no monitoring, quality, observability, etc…). This is due to a lack of engineering culture. People are trying to optimize and save the present without caring about the future.

Since the DevOps wave arrived, the situation has moved a lot. Awareness increased, but we are still far from having a satisfactory average quality standard.

Now let’s talk about data

Data is becoming the nervous system of every company, large or small, a fundamental element to differentiate on the market in terms of services offer and automated support to business decisions.

If we genuinely believe that they are such a backbone, we must treat them “professionally”; if we really consider them a good, a service, or a product, we must engineer them. There is no such thing as a successful product that has not been properly engineered. The factory producing such products has undoubtedly been super-engineered to ensure low production costs along with no manufacturing defects and high sustainability.

Eventually, you must figure out if data for your business is that critical and behave accordingly, as usually done with everything else (successful) around us.

I can proudly say that at Agile Lab we believe data is crucial, and we assume that this is just as true for our customers.

That’s why we’ve been trying to engineer the data production and delivery process for years from a design, operation, methodology, and tooling perspective. Engineering the factory but not the product, or vice versa, definitely leads to failure in the real world. It is important to engineer the data production lifecycle and the data itself as a digital good, product, and service. Data does not have only one connotation, instead it is naturally fickle. I’d define it as a system that must stay balanced because, exactly as in quantum systems, the reality of data is not objective but depends on the observer.

Our approach has always been to consider the data as the software (and in part, the data is software), but without forgetting the greater gravity (Data Gravity — in the Clouds — Data Gravitas).

Here I see a world where top-notch Data Engineers make the difference for the success of a data-centric initiative. Still, all this is not recognized and well-identified like software development because just a few people know what a good data practice looks like.

But that’s about to change because, with the rise of Data Mesh, data teams will have to shift gears, they won’t have anyone else to blame or take responsibility in their places, they’ll have to work to the best of their ability to make sure their products are maintainable and evolvable over time. The various data products and their quality will be measurable, objectively. Since they will be developed by different teams and not by a single one, it will be inevitable to make comparisons: it will be straightforward to differentiate teams working with high-quality standards from those performing poorly, and we sincerely look forward to this.

The revolutionary aspect of Data Mesh is that, finally, we don’t talk about technology in Data Engineering! Too many data engineers focus on tools and frameworks, but practices, concepts, and patterns matter also if not more: in fact, eventually technologies are all due to become obsolete, instead of good organizational, methodological, and engineering practices that can long last.

Time to show the cards

So let’s come to what we think are the characteristics and practices that should be part of a good Data Engineering process.

Software Quality:

The software producing and managing data must be high-quality, maintainable, evolvable. There’s no big difference between software implementing a data pipeline and a microservice in terms of quality standards! In terms of workflow, ours is public and open source. You can find it here

Software design patterns and principles must also be applied to data pipelines development. Please stop scripting data pipelines (notebooks didn’t help here).

Data Observability:

Testing and monitoring must be applied to data, as well as to software, because static test cases on artifact data are just not enough. We do this by embracing the principles of DODD (Kensu | DODD). Data testing means considering the environment and the context where data lives and is observed. Remember that data offers different realities depending on when and where it’s evaluated, so it’s important to apply principles of continuous monitoring to all the lifecycle stages, from development through deployment, not only to the production environment.

When we need to debug or troubleshoot software we always try to have the deepest possible context. IDEs or various monitoring and debugging tools help by automatically providing the execution context ( stack trace, logs, variables status, etc.) letting me understand the system’s state at that point in time. 

How often did you see “wrong data” in a table having no idea on how it got there? How exausting was to: hand the piece of code allegedly generating such data, hoping it’s versioned not hardcoded somewhere, tried to mentally re-create the corner case that might have caused the error, searched for unit tests supposedly covering such corner case, scraping the documentation hoping to find references of that scenario… It’s frustrating, because data seems just plunged from the sky, with no traces left. I think 99% of data pipelines implemented out there have no links between the input (data), the function (software) and the output (data) thus making troubleshooting or, even worse, the discovery of an error just a nightmare.

In Agile Lab, we apply the same principles seen for software: we always try to have as much context as possible to understand better what’s happening to a specific data row or column. We try to set up lineage information (stacktrace), data profiling (general context) and data quality before and after the transformations we apply (execution context). All these information need to be usable right when needed and be well organized.

“OK you convinced me: which tool can make all of this possible ?”

Well, the reality is that practice is about people — professionals — not tools. Of course, many tools help in implementing the practice, but without the right mindsets and skills they are just useless.

“But I have analysts creating periodic reports and they can spot unexpected data values and distributions”

True, but ex-post checks don’t solve the problem of consumers having probably already ingested faulty data (remember the data gravity thing?) and they don’t provide any clues about where and why the errors came up. This to say that it makes a huge difference to proactively anticipate checks and monitoring so to make them synchronous with the data generation (and part of applications-generating-data lifecycle).

Decoupling:

It has become a mantra in software systems to have well-decoupled components that are independently testable and reusable.

We use the same measures with different facets in data platforms and data pipelines.

A data pipeline must:

  • Have a single responsibility and a single owner of its business logic.
  • Be self-consistent in terms of scheduling and execution environment.
  • Be decoupled from the underlying storage system.
  • I/O actions must be decoupled from business (data transformation) logic, so to facilitate testability and architectural evolutions.

Also, it is essential to have well-decoupled layers at the platform level, especially storage and computing.

Reproducibility:

We need to be able to re-run data pipelines in case of crashes, bugs, and other critical situations, without generating side effects. In the software world, corner case management is fundamental, just like (or even more) in the data one (because of gravity).

We design all data pipelines based on the principles of immutability and idempotence. Despite the technology, you should always be in the position to:

  • Re-execute a pipeline gone wrong.
  • Re-execute an hotfixed/patched pipeline to overwrite the last run, even if it was successful.
  • Restore a previous version of the data.
  • Run a pipeline in recovery mode if you need to clean up an inconsistent situation. The pipeline itself holds the cleaning logic, not another job.

So far, we have talked about data itself, but even more critical aspect is the underlying environments and platforms status reproducibility. Being able to (re)create an execution environment with no or minimum “manual” effort could really make the difference for on-the-fly tests, no-regression tests on large amounts of data, environment migrations, platform upgrades, switch on/simulate disaster recovery. It’s an aspect to be taken care of and can be addressed through IaC and architectural Blueprint methodologies.

Data Versioning:

The observability and reproducibility issues are rooted in the ability to move nimbly on data from different environments as well as from different versions in the same environment (e.g., rollback a terminated pipeline with inconsistent data).

Data, just like software, must be versioned and portable with minimum effort from one environment to another.

Many tools can help (such as LakeFS), but when they are not available it’s still possible to set up a minimum set of features allowing you to efficiently:

  • · Run a pipeline over an old version of data (e.g., for repairing or testing purposes)
  • Make an old version of the data available in another environment.
  • Restore an old version of the data to replace the current version.

It is critical to design these mechanisms from day zero because they can also impact other pillars.

Vulnerability Assessment:

It is part of an engineer’s duties to take care of security, in general. In this case, data security is becoming a more and more important topic and must be approached with a by-design logic.

It’s essential to look for vulnerabilities in your data and do it in an automated and structured way so to discover sensitive data before they go accidentally live.

This topic is very important at Agile Lab, so much that we have even developed proprietary tools to promptly find out if sensitive information is present in the data we are managing. These checks are continuously active, just like Observability checks.

Data Security at rest:

Fine-grained authentication and authorization mechanisms must exist to prevent unwanted access to (even just portion of) data (see RBAC, ABAC, Column and Row filters, etc…).

Data must be protected, and processing pipelines are just like any other consumer/producer so policies must be applied to storage and (application)users, that’s also why decoupling is so important.

Sometimes, due to strict regulation and compliance policies, it’s not possible to access fine-grained datasets, although aggregates or statistical calculations might still be needed. In this scenario, Privacy-Preserving Data Mining (PPDM) comes in handy. Any well-engineered data pipeline should compute its KPIs with the lowest possible privileges to avoid leaks and privacy violations (auditing arrives suddenly and when you don’t expect it). Have fun the moment you need to ex-post troubleshoot a pipeline that embeds PPDM, which basically gives you no data context to start from!

Data Issues:

Data Observability chapter provides capabilities to perform root cause analysis, but how do we document the issues? How do we behave when a data incident occurs? The last thing you wanna do is to “patch the data”; what you need to do is fixing the pipelines (applications), which will then fix the data.

First, the data issues should be reported just like any other software bug. They must be documented accurately, associating them with all the contextual information we extracted from the data observability, to clarify what would have been expected, what happened instead, what tests have been prepared and why they did not intercept the problem.

It is vital to understand the corrective actions at the application level (they could drive automatic remediation) as well as the additional tests that need to be implemented to cover the issue.

Another aspect to consider is the ability to associate data issues to the specific version of both software and data (data versioning) responsible of the issue because, sometimes, it’s necessary to put in place workarounds to restore business functionality first. At the same time, while data and software evolve independently, we must reproduce the problem and fix it.

Finally, to avoid repeating the same mistakes, it’s crucial to conduct a post-mortem with the team and ensure these are saved in a simil-wiki or somehow associated with the issue, searchable and discoverable.

Data Review:

Just as code reviews are a good practice, a good data team should perform and document data reviews. These are carried out at different development stages and cover multiple aspects such as:

  • Documentation completeness
  • Semantic and syntactic correctness
  • Security and compliance
  • Data completeness
  • Performance / Scalability
  • Reproducibility and related principles
  • · Test cases analysis and monitoring (data observability)

We encourage adopting an integrated approach across code base, data versioning, and data issues.

Documentation and metadata:

Documentation and metadata should follow the same lifecycle of code and data. Computational governance policies of the data platform should prevent new data initiatives to be brought into production if not well paired with the proper level of documentation and metadata. In fact, this aspect in the Data Mesh will become a foundation for having good data products that are genuinely self-describing and understandable.

Having ex-post metadata creation and publishing can introduce a built-in technical debt due to the potential misalignment with the actual data already available to consumers. It’s then good practice to make sure that metadata resides within the same repositories of pipelines’ code generating the data, which should imply synchronization between code and metadata and ownership by the very same team. For this reason, metadata authoring shouldn’t take place on Data Catalogs, which are supposed to help consumers find out the information they need. Embedding a metadata IDE to cover malpractices in metadata management won’t help in cultivating this culture.

Performance:

Performance and scalability are two engineering factors that are critical in every data-driven environment. In complex architectures, there can be hundreds of aspects with an impact on scalability and good data engineers should possess them as foundational design principles.

Once again, I want to highlight the importance of a methodological approach. There are several best practices to encourage:

  • Always isolate the various read, compute, and write phases by running dedicated tests that will point out specific bottlenecks thus allowing better-targeted optimization: when you’re asked to reduce the execution time of a pipeline by 10%, it must be clear which part of the process is worth working on. A warning here: isolating the various phases brings up the theme of decoupling and single responsibility. If your software design doesn’t respect these principles, it won’t be modular enough to allow clear steps isolation. For example, Spark has a lazy execution model where it is very hard to discern read times from transformation and write times if you are not prepared to do so.
  • Always run several tests on each phase with different (magnitudes of) data volumes and computing resources to extract scaling curves that will allow you to make reasonable projections when the reference scenario will change: it’s always unpleasant to find out that input data is doubling in a week and we have got no idea of how many resources will be needed to scale up. Worst case scenario to performance-test pipelines against should be designed with long-term data strategy in mind.
  • Always run several tests on each phase with different (magnitudes of) data volumes and computing resources to extract scaling curves that will allow you to make reasonable projections when the reference scenario will change: it’s always unpleasant to find out that input data is doubling in a week and we have got no idea of how many resources will be needed to scale up. Worst case scenario to performance-test pipelines against should be designed with long-term data strategy in mind.

Wrapping up

If you’ve come this far, you’ve understood what Data Engineering means to us.

Elite Data Engineering is not focusing only on data. Still, it guarantees that the data will be of high quality, that it will be easy to find and correct errors, that it will be simple and inexpensive to maintain and evolve, and that it will be safe and straightforward to consume and to understand.

It’s about practice, data culture, design patterns, broader sensitivity across many more aspects we might be used to when “just dealing with software”. At Agile Lab, we invest every single day in continuous internal training and mentoring on this, because we believe that Elite Data Engineering will be the only success factor of every data-driven initiative.

If you’re interested in other articles on the Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

 

10 practical tips to reduce Data Mesh’s adoption roadblocks and improve domains’ engagement

10 practical tips to reduce Data Mesh’s adoption roadblocks and improve domains’ engagement

Photo by Roberto Coluccio 

Hi! If you got here is probably because:

A) you know what Data Mesh is;

B) you are about to start (or already started) a journey towards Data Mesh;

C) you and your managers are sure that Data Mesh is the way to sky-rocket the data-driven value for your business/company; but

D) you are among the few who have bought the thing in and around you there are skepticism and worries.

The good news is: you’re DEFINITELY not alone! 

At Agile Lab we’re driving Data Mesh journeys for several customers of ours and, every time a new project kicks off, this roadblock is always one of the first to tackle.

Hoping to help out anyone just approaching the problem and/or looking for some tips, here are 10 ways we found out being somehow successful in real Data Mesh projects. We’re gonna group them by “area of worry”.

BUDGET 💸

Photo by Karolina Grabowska from Pexels

Well, you know, in the IT world every decision is usually based on the price VS benefit trade-off.

 Domains are expected to start developing Data Products as soon as possible and we need them aboard, but usually “there’s no budget for new developments”, or “there’s no budget to enlarge the cluster size”. There you go:

1. Compute resources to consume data designed as part of consumers’ architecture

This is a general architectural decision rather than a tip: you should design your output port (and their consumption model) so to provide shared governed access to potential consumers: with simple ACLs over your resources, you can allow secure read-only access to your Data Products’ output ports so that no compute resources are necessary at the (source) domain/Data Product level in order to consume that data, all the power is required (AND BILLED) at the consumers’ sides.

Ok, so let’s say we figured out a way to avoid (reduce?) TCO for our Data Products. But I’m what about the budget to develop them? 

2. Design a roadmap of data products creation that makes use of available operational systems maintenance budget streams

In complex organizations there are usually several budget streams flowing for operational systems maintenance: leverage them to create Source Aligned Data Products (which are flawlessly part of the domain so nobody will argue)!

Very well, we’re probably now getting some more traction, but why should we maintain long-term ownership over this stuff?

3. Pay-as-you-consume billing model

I know, this is bold and, so far, still in the utopia sphere. But there’s a key point to catch: if you figure out a way to “give something back” to Data Products’ creators/owners/maintainers, being that extra budget for every 5 consumers of an output port, or participation by the consumers to the maintenance budget … anything can pull up the spirit and diffuse the “it’s not a give-only thing this one” mood. If you get it, you win.

But let’s say I’m the coordinator of the Data Mesh journey and I received a budget just for a PoC… well, then:

4. Start small, with the most impactful yet less “operative business-critical” consumer-driven data opportunity

No big bang, don’t make too much loud (yet), pick the right PoC use case and make it your success story to promote the adoption company-wise, without risking too much.

This will probably make the C-level managers happy. The same ones who are repeatedly asking for the budget to migrate (at least) the analytical workloads to the Cloud, without receiving it “because we’re still amortizing the expense for that on-prem system”. Well, here’s the catch: if they want you to build a Data Mesh, you’re gonna need an underline stack that allows you to develop the “Self-Serve Infrastructure-as-a-Platform”. That’s it:

5. Leverage the Data Mesh journey as an opportunity to migrate to the Cloud

Your domain teams will probably be happy to fast forward to the present days technology-wise and work on Cloud-based solutions.

 

This last tip brings to light the other “area of worry”:

THE UNKNOWN ⚠

Photo by Jens Johnsson from Pexels

Data Mesh is an organizational shift that involves processes and, most of all, people.

People are usually scared of what they don’t know, especially in the work environment where they might be “THE reference” to everybody’s eyes for system XY, while they’d feel losing power and control by “simply becoming part of a decentralized domain’s Data Product Team”. 

First of all, you’re gonna need to:

6. Train people

Training and mentoring of key (influencer?) people across the domains is a successful-proven way of making people UNDERSTAND WHY the company decided to embrace this journey. Data Mesh Experts should mentor the focal point so that they can, in their turn, spread the positive word within their teams.

But teams are probably scared too of being forced to learn new stuff, so:

7. Make people with “too legacy technical background” understand this is the next big wave in the data world

Do they really wanna miss the train? Changes like this one might come once every 10 or 20 years in big organizations: don’t waste the opportunity and get out of that comfort zone!

Nice story, so are you telling me we have organizational inefficiencies and we don’t make smart use of our data to drive or automate business decisions? Well, my dear, it’s not true: MY DOMAIN publishes data that I’ve been told (by a business request) should be used somewhere.

I see you, but understand that your view might be limited. There are plenty of others complaining about change management issues, dependencies issues, slowness in pushing evolutions and innovations, lack of ownership and quality, etc.. etc.. etc …

8. Organize collaborative workshops where business people meet technical people and share pain points from their perspectives

You will be surprised by the amount of “hidden technical debt” that will come out, and people will start empathizing with each other: this will start transferring the “My data can bring REAL VALUE” (because real people are complaining in front of them) to the organization.

Does this end once the Data Products have been released? Absolutely not! We talked about «long-term ownership» so we need to provide «long-term motivation»:

9. Collect clear insights about the value produced with data and make it accessible to everybody within the organization (through the platform)

This can start as simple as «Number of consumers of a Data Product» (what if it’s zero?) and evolve in more complex metrics like the time-to-market (it’s expected to be reduced, it’s the time to bring into production a new Data Product), the network effect (the more the interconnections across Domains, the more value is probably added afterwards), and other metrics more specific and valuable to your corporation.

You convinced me. I will develop Data Products. Where do I start?

10. Organize collaborative workshops to identify data opportunities

Following the DDD (Domain Driven Design) practice the first pillar of Data Mesh is based on, Domain-oriented Decentralized Ownership, and adding it up to the Data-as-a-Product one, we get a “business decisions”-driven design model to identify data opportunities (i.e. Data Products to be created in order to support/automate data-driven business decisions).

 

On this topic, you might be interested in the Data Product Flow we designed at Agile Lab, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

Author: Roberto Coluccio, Agile Lab Data Architect

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.

Customer 360 and Data Mesh: friends or enemies?

Customer 360 and Data Mesh: friends or enemies?

Raise a hand who saw, was asked to design, tried to implement, struggled with the “Customer 360” view/concept in the last 5+ years…

Come on, don’t be shy …

How many Customer 360 stories did you see succeeding? Me? Just … a few, let’s say. This made me ask: why? Why put so much effort into creating a monolithic 360 view thus creating a new maintenance-evolution-ownership nightmare silo? Some answers might fit here, in the context of centralized architectures, but recently a new antagonist to fight against came in town: Data Mesh.

“Oh no, another article about the Data Mesh pillars!”

No, I’m not gonna spam the web with yet-another-article about the Data Mesh principles and pillars, there’s plenty out there, like thisthis, or those, and if you got here is because you might already know what we’re talking about.

The question that arises is:

How does the (so struggling or never really achieved) Customer 360 view fit into the Data Mesh paradigm?

Well, in this article I’ll try to come up with some options, coming from some real AgileLab’s customers approaching the “Data Mesh journey”.

This fascinating principle, inherited (in terms of effectiveness) from the microservices world, brings to light the necessity to keep together tech and business knowledge, within specific bounded contexts, so as to improve autonomy and velocity of data products lifecycle along with a smoother change management. Quoting Zhamak Dehghani:

Eric Evans’s book Domain-Driven Design has deeply influenced modern architectural thinking, and consequently the organizational modeling. It has influenced the microservices architecture by decomposing the systems into distributed services built around business domain capabilities. It has fundamentally changed how the teams form, so that a team can independently and autonomously own a domain capability.

Examples of well-known domains are Marketing, Sales, Accounting, Product X, or Sub-company Y (with subdomains, probably). All of these domains have probably something in common, right? And here we connect with the prologue: the customer. We all agree on the importance of this view, but eventually do we really need this to be materialized as a monolithic entity of some sort, on a centralized system?

I won’t answer this question, but I’ll make another one:

Which domain would a customer 360 view be part of?

Remember: decentralization and domain-driven design imply having clear ownership which, in the Data Mesh context, means:

  • decoupling from the change management process of other domains
  • owning the persistence (hence: storage) of the data
  • provide valuable-on-their-own views at the Data Products’ output ports (is a holistic materialized 360 customer view really valuable-on-its-own?) — this is part of the “Data as a Product” pillar, to be precise
  • guarantee no breaking changes to domains that depend on the data we own (i.e. own the full data lifecycle)
  • (last but not least) having solid knowledge (and ownership) of the business logic orbiting around the owned data. Examples are: knowing the mapping logic between primary and foreign keys (in relational terms), knowing the meaning of every bit of information part of that data, knowing what are the processes that might influence such data lifecycle

If you can’t now answer the above question DON’T WORRY: you’re not alone! It just means you’re starting as well to feel the friction between the decentralized ownership model and the centralized customer 360 view.

IMHO, they are irreconcilable. Here’s why, with respect to the previous points:

  • centralized ownership over all the possible customer-related data would couple the change management processes of all the domains: a 360 view would be the only point-of-access for customer data, thus requiring strong alignment across the data sources feeding the customer 360 (which usually would end up slooooowing down every innovation or evolution initiative). Legit question: “So, this means we should never join together data coming from different domains?” — Tough but realistic answer: “You should do that only if it makes sense, if it produces clear value, if it doesn’t require further interpretation or effort on the consumers’ side to grasp such value”.
  • If you bind together data coming from different domains, it must be to create a valuable-on-its-own data set. Just the fact that you read, store and expose data would make you the owner of it, that’s why a customer 360 persisted view would make sense only actually owning the whole dataset, not just a piece of it. Furthermore, Data Products require immutable and bi-temporal data at their output ports, can you imagine the effort of delivering such features over such a huge data asset?
  • valuable-on-its-own Data Products, well, I hope the previous 2 points already clarified this point 😊
  • If we sort out all the previous issues and actually become the owner of a customer 360 persisted materialized view, while still being compliant to the Data Mesh core concepts, we should never push breaking changes towards our data consumers. If we need to do make a breaking change, we must guarantee sustainable migration plans. Data versioning (like LakeFS or Project Nessie) could come in handy in this case, but there might be several organizational and/or technical ways to do so. Having many dependencies (like we’d have owning a customer 360 view) would make it VERY hard to always have a smooth data lifecycle and avoid breaking changes since we could find ourselves just overwhelmed by breaking changes made at the source operational systems side.
  • The whole Data Mesh idea emerged as a decentralization need, after seeing so many “central data engineering / IT teams” fall under the pressure of all the organization which was relying on them to have quality data. It was (is) a bottleneck in many cases worsened by the fact that these centralized tech teams couldn’t just have sufficiently strong business knowledge of every possible domain (while unfortunately they were developing and managing all the ETLs/ELTs of the central data platform). For this very same reason, the Customer 360 team would be required to master the whole business logic around such data, i.e. being THE acknowledged experts of basically every domain (having customer-related data) — just impossible.
a customer360 logical view, with domains owning only a slice of it

OK, I’m done with the bad news 😇 Let’s start with the good ones!

Customer 360 view and Data Mesh can be friends

As long as it remains a holistic logical/business concept. Decentralized domains should own the data related to their bounded contexts, even if referring to the customer. Domains should own a slice of the 360 view and slices should be correlated together to create valuable-on-their-own Data Products (e.g. just a join between sales and clickstreams doesn’t provide any added value, but a behavioral pattern on the website with the number of purchases per customer with specific browsing-behaviors do).

In order to achieve that, domains and – more in general – data consumers must be capable of correlating customer-related data across different domains with ease. They must be facilitated on the joining logic level, which means they should NOT be required to know the mapping between domain-specific keys, surrogate keys, and customer-related keys. I’ll try to expand this last element.

Globally Unique Identifiable Customer Key

According to the Data Mesh literature, data consumers shouldn’t always copy (ingest in the first place) data in order to do something with it, since storing data means having ownership over it. As a technical step justified by volumes or other constraints for a single internal ETL step, that’s ok, but Consumer Aligned Data Products shouldn’t just pull data from another Data Product’s output port, perform a join or append another column to the original dataset, and publish at their output port an enhanced projection of some other domain’s data, for many reasons, but I won’t digress on this (again, it’s part of the Data as a Product pillar). What data consumers need is a well-documented, globally identifiable, and unique key/reference to the customer, in order to perform all the possible correlations across different domains’ data. The literature would also call them polysemes.

Notewell-documented should become a computational policy (do you remember the Federated Governance Data Mesh’s pillar?) requiring for example that, in the Data Product’s descriptor (want to contribute to our standardization proposal?), a specific set of metadata must describe where a certain field containing such key comes from, which domain generated it, what business concepts it points to. Maybe that could be done also leveraging a syntax or language that will facilitate the automated creation (thank you Self-Serve Infrastructure-as-a-Platform) of a Knowledge Graph afterward.

OK, the concepts of polysemes and the globally unique identifiable customer-related key are not new to you, and they shouldn’t, but what I want to put the lights on is how to conjugate it with the Data Mesh paradigm, especially because several architectural patterns are technically possible but just a few (one?) of them should guarantee long term scalability and compliance with the Data Mesh pillars.

An MDM is still a good alley

The question you might have asked yourself at this point is:

Who has the ownership of issuing such a key?

The answer is probably in the good-old-friend the Master Data Management (MDM) system, where the customer’s golden record data can be generated. There’s a lot of literature (example) on that concept, also because it’s definitely not new in the industry (and that’s why a lot of companies approaching Data Mesh are struggling to understand how to make it fit in the picture since it can just be thrown away after all the effort spent on building it up).

Note: eventually, such a transactional/operational system might also lead to developing a related Source Aligned Data Product, but it should just be considered as a source of the customers’ registry information.

OK, but who should be then responsible of performing the reverse lookup, to map a domain-specific record (key) to the related customer (golden record’s) key?

We narrow down the spectrum into 3 possible approaches. To facilitate the reading of what follows, I’d like to point out a few things first:

  • SADP = Source Aligned Data Product
  • CADP = Consumer Aligned Data Product
  • CID = Customer MDM unique globally identifiable key
  • K = example of a domain-specific primary key
Approach 1: The Ugly
Approach 2: The Bad
Approach 3: The Good

The latter approach is what could make a customer 360 view shine again! This way, domains — and in particular the operational systems’ teams- preserve the ownership of the mapping logic between the operational system’s data key and the CID, while in the analytical plane no other interactions with the operational one are required to create Source Aligned Data Products other than pulling data from the source systems, and at the Consumer Aligned Data Products sides no particular effort is spent to correlate customer-related data coming from different domains.

Disclaimer 1: approach 3 requires strong Near-Real-Time alignment between the Customer MDM system and the other operational systems since a misalignment could imply publishing data from the various domains not attached to a brand new available CID (many MDM systems operate their match-and-merge logic in batch, or many operational systems might think to update-set the CID field with batch reverse lookups quite after domain data is generated).

Disclaimer 2: a fourth approach could be possible, i.e. having microservices issuing in NRT global IDs which become master key in all the domains having customer-related data. This is usually achievable in custom implementations only.

Wrapping up

In the Data Mesh world, domain-driven decentralized ownership over data is a must. The centralized Customer 360 monolithic approach doesn’t fit the picture, so a shift is required in order to maintain what actually matters the most: the business principle of customer-centric insights, derived from the correlation of data taken from different domains (owners) via a well documented and defined globally unique identifier, probably generated at Customer MDM level and integrated “as left as possible” into the operational systems, so to reduce the burden at the Data Mesh consumers’ side of knowing and applying the mapping logic between domain-specific key and Customer MDM key.

. . . 

What has been presented has been discussed with several customers and it seemed to be the best option so far. We hope the hype around Data Mesh will bring some more options in the very next years and will keep our eyes open but, if you already put in place a different approach, I’ll be glad to know more.

 

If you’re interested in other articles on the Data Mesh topics, you can find them on our blog, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

Author: Roberto Coluccio, Agile Lab Data Architect

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.

How and why Data Mesh is shaping the data management’s evolution

How and why Data Mesh is shaping the data management’s evolution

 

What is Data Mesh?

Data mesh is not a technology, but it is a practice…a really complex one. It is a new way to look at data, data engineering, and all the processes involved with that, i.e. Data Governance, Data Quality, auditing, Data Warehousing, and Machine Learning. It’s completely changing the perspective on how we look at data inside a company: we are used to looking at them from a technical standpoint, so how to manage them, how to process them, which is the ETL tool, which is the storage….and so on….while it’s time to look at them from a business perspective and then deal with them as they are real assets, let’s say as they are a product. Therefore, in the data mesh practice we need to reverse our way of thinking about data: we tipically see data as a way to extract value for our company, but here instead we need to understand that data IS THE VALUE.

Why is relevant? What problem is going to solve?

If in the past years you invested a lot in Data Lake technologies I have a piece of bad news and a good one.

The good one is: you can understand which are the problems and why Data Mesh is going to solve them.

The bad one is: probably you’ll need to review your data management strategy if you don’t want to fall behind your competitors.

The main problem is related to how Data Lake practice has been defined. At the first stage (I’m talking about the period from 2015 to 2018, at least in the Italian market) it was not a practice, it was a technology: in the specific, it was Hadoop that was the data lake.

It was very hard to manage, mainly on-premise environments, skills on such technologies were lacking. And… there was no practice defined. Think about that first person that coined the “data lake” term, James Dixon, who was the CTO of Pentaho: Pentaho is an ETL tool, so it’s easy to understand that focus was on how to store and how to transform huge quantities of data, not on the process of value extraction or value discovery.

From an organizational perspective, it was a good option to set up a data lake team in charge of centrally managing the technical complexity of the data lake and, when things became more complicated, the same team was in charge of managing different layers (cleansed layer, standardized layer, harmonization layer, serving layer) … all technical ones, obviously.

Then, this highly specialized team in most cases was managing all the requests coming from the business, which was in charge of gaining insights from data, yet without knowledge on the data domains because it always focused on technicalities and it is still relying on operational system teams to gain functional and domain information about the data they are managing. This is representing a technical and organizational bottleneck and this is why a lot of enterprises are struggling in extracting value from data: because the implementation of a single KPI generates a change management storm on all the technical layers of the data lake; also it’s requiring to involve people from operational systems that are not engaged and interested in such process…not their duty. This is a film that is not going to work because there is no alignment in the purpose of all the actors (also because the organization itself is not oriented to the business purpose).

Why is not super popular then? How is going to solve the problems you mentioned before?

Data mesh is putting data at the center. As I said Data is the value, data is the product…so the greatest effort is spent in letting people CONSUME data instead of processing and managing them.

The core idea is that we will have only two stakeholders, who are producing data, let’s say selling data, and who are consuming data, buying them.

And we all know that in a real-world the customer is ruling the market, so let’s think about which are the needs of data consumers:

1. they want to be able to consume all the data of the company in a self-service way. Amazon-like, you browse and you buy, so basically you are consuming products from a marketplace.

2. they don’t want to deal with technical integration every time they need a new data product.

3. they don’t want to spend time trying to understand which are the characteristics of the data, e.g. if you are willing to buy something on Amazon, are you going to call the seller to understand better how the product looks like? NO! The product is self-describing, you must be able to get all the information you need to buy the product without any kind of interaction with the seller. That’s the key, because any kind of interaction is tightly coupled, sometimes you need to find a slot in the agenda, wait for documentation, and so on, this requires time thus destroying the data consumer time to market !!!

4. they want to be able to trust the data they are consuming. Checking the quality of the product is a problem for who is selling, not buying it. If you buy a car, you don’t need to check all the components, there is a quality assurance process on the seller side. Instead, I saw many and many times performing quality assurance in the Data Lake after ingesting data from the operational system, doing the same in the DWH after ingesting data from the lake, and again in BI tool maybe….after ingestion from DWH.

Data Mesh is simply enabling this kind of vision around data in a company. Simple concept, but really hard to build in practice: that’s why it’s still not mainstream ( but it is coming ). It has been coined in 2019 and we are still in an experimental phase with people exchanging opinions and trying it. What is truly game-changing for me is that we are not talking about technology, we are talking about practice!

What is the maturity grade of this practice?

Thoughtworks made an excellent job of explaining the concepts and evangelizing around them but, at the moment, there is no reference physical architecture.

I think that we, as Agile Lab, are doing an excellent job in going from theory to practice and we are already in production with our first data mesh architecture on a large enterprise.

Anyway, the level of awareness around these concepts is still low, in the community we are discussing data product standardization and other improvements that we need to take into account to make this pattern easier to implement. At the moment, it’s still too abstract and it requires a huge experience on data platforms and data processes to convert the concept into a real “Data Mesh platform”.

As happened with microservices and operational systems (Data Mesh has a lot of assonances with microservices), so far the new practice seems elegant from a conceptual standpoint but hard to be implemented. It will require time and effort but it will be worth it.

What is the first step to implement it?

My suggestion is to start identifying one business domain and try to apply the concepts only within these boundaries. To choose the right one, keep in consideration the ratio between business impact and complexity, which should be the highest. Reduce the number of technologies needed for the use case as much as you can (because they have to be provided self-service, mandatory).

In addition, try to build your first real cross-functional team, because the mesh revolution is also going through the organization and you need to be ready for that.

How much time is required for a large organization to fully embrace data mesh?

I think that a large enterprise with a huge legacy should estimate, at least, 3 years of journey to really comprehend and see benefits from it.

What is the hardest challenge in implementing a data mesh?

The biggest one is the paradigm shift that is needed and this is not easy for people working with data for a long time.

The second one is the application of evolutionary architecture principles to data architectures, an unexplored territory…it is not immediate, but if you think about data management systems, the architecture doesn’t change that much until is obsolete and you need a complete re-write.

In the Data Mesh, instead, we try to focus on producer/consumer interfaces and not on the data life cycle. This means that, until we keep this contract safe, we can evolve our architecture under the hood, one piece at a time and with no constraints.

Another challenge is the cultural change and the ways you can drive it in huge organizations.

Is there any way to speed up the process?

From a technical standpoint, we think that the key is to reach a community-driven standardization and interoperability of the data product as architectural quantum. Because after this it will be possible to build an ecosystem of products that will speed-up the process.

Anyway, the key components to create a data mesh platform are :

  • Data Product Templates: to spin-up all the components of a DataProduct, with built-in best practices
  • Data Product Provisioning Service: all the DataProduct deployments should go through a unique provisioning service in charge to apply computational policies, create the underlying infrastructure and feed the marketplace
  • Data Product Marketplace: a place where is easy to search and discover Data Products, not just data.

 

On this topic, you might be interested in the Data Product Flow we designed at Agile Lab, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.

How to identify Data Products? Welcome “Data Product Flow”

How to identify Data Products? Welcome “Data Product Flow”

 

We all know that Data Mesh is inspired by DDD principles (by Eric Evans), transposing them from the operational plane (microservice architecture) to the analytical one (data mesh).

Since one year and a half, we have been helping many big enterprises to adopt the data mesh paradigm, and it’s always the same old story. The concept itself is really powerful and everybody can catch the potential from the early days because it addresses real problems, so it is not possible to ignore it.

Once you have the buy-in of high-level concepts and principles, it is time to draft the platform capabilities. This step is game-changing for many companies and is often becoming challenging because it revolutionizes processes and tech stacks.

But the most complex challenge is another one; it is something that blows people’s minds away.

What is a Data Product? I mean, everyone understands the principles behind that, but when it comes to defining it physically… is it a table? Is it a namespace? How do I map it with my current DWH? Can I convert my Data Lake to Data Products? These are some of the recurring questions… and the answer is always “no” or “it depends”.

When we start to introduce concepts like bounded context and other DDD elements, most of the time is getting even harder because they are abstract concepts and people involved in Data Management are not familiar with them. We are not talking with software experts; DDD until now has been used to model software, online applications that need to replicate and digitalize business processes. Data Management people were detached from this cultural shift; they typically reason around tables, entities, and modeling techniques that are not business oriented: 3NF, Dimensional modeling, Data Vault, Snowflake model… all of them are trying to rationalize the problem from a technical standpoint.

So after a while, we arrive at the final question: How do we identify Data Products?

For DDD experts, the answer could seem relatively easy…but it is not !!!

Before to deep dive into our method to do that, let’s define an essential glossary about DDD and Data Mesh (coming from various authors):

Domain and Bounded Context (DDD): Domains are the areas where knowledge, behaviour, laws and activities come together. They are the areas where we see semantic coupling and behavioural dependencies. It existed before us and will exist after us; it is independent by our awareness.

Each domain has a bounded context that defines the logical boundaries of a domain’s solution, so bounded contexts are technical by nature and tangible. Such boundaries must be clear to all people. Each bounded context has its ubiquitous language (definitions, vocabulary, terminology people in that area currently use). The assumption is that the same information can have different semantics, meanings and attributes based on the evaluation context.

Entity (DDD): Objects that have a distinct identity running through time and different representations. You also hear these called «reference objects».

Aggregate (DDD): It is a cluster of domain objects or entities related to each other through an aggregate root and can be treated as a single unit. An example can be an order and its line items or a customer and its addresses. These will be separate objects, but it’s useful to treat the order ( together with its line items ) as a single aggregate. Aggregates typically have a root object that provides unique references for the external world, guaranteeing the integrity of the Aggregate as a whole. Transactions should not cross aggregate boundaries. In DDD, you have a data repository for each Aggregate.

Data Product (Data Mesh): It is an independently provisionable and deployable component focused on storing, processing and serving its data. It is a mixture of code, data and infrastructure with high functional cohesion. From a DDD standpoint, it is pretty similar to an Aggregate.

Output Port (Data Mesh): It is a highly standardized interface, providing read-only and read-optimized access to Data Product’s data.

Source-aligned Data Product (Data Mesh): A Data Product that is ingesting data from an operational system (Golden Source)

Consumer-aligned Data Product (Data Mesh): A Data Product that is consuming other data products to create brand new data, typically targeting more business-oriented needs

So, how do we identify Data Products?

Because a Data Product, from a functional standpoint, is pretty similar to an Aggregate, we can say that the Data Product’s perimeter definition procedure is pretty similar. Keep in mind that this is valid only for Source Aligned Data Products because drivers are changing when we enter into the data value chain.

In Operational Systems, where the main goal is to “run the business”, we define Aggregates by analyzing the business process. There are several techniques to do that. The one I like most is Event Storming (by Alberto Brandolini). It is a kind of workshop that helps you decompose a business process into events (facts) and analyze them to find the best aggregate structure. The steps to follow are straightforward:

  1. Put all the relevant business events in brainstorming mode
  2. Try to organize them with a timeline
  3. Pair with commands and actions
  4. Add actors, triggers, and systems
  5. Try to look for emerging structures
  6. Clean the map and define aggregates

The process itself is simple (I don’t want to cover it because there is a lot of good material out there), but what makes the difference is who and how is driving it. The facilitator must ask the domain experts the right questions and extract the right nuances to crunch the maximum amount of domain information.

There are other techniques like bounded context canvas or domain storytelling, but, if your goal is to identify Data Products, I would suggest using Event Storming because, in Data Mesh, one of the properties that we want to enforce is the immutability and bi-temporality of data, something that you mainly achieve with a log of events.

So, Event Storming is the preliminary step of our process to discover aggregates.

If your operational plane is already implemented with a microservice architecture with DDD, probably you already have a good definition of aggregates but, if you have legacy operational systems as golden sources, you need to start defining how to map and transform centralized and rationalized data management practice into a distributed one.

Once you have your business processes mapped, it’s time to think about data and how they can impact your business. The analytical plane is optimizing your business, and this comes with better and informed decisions.

Now we enter into our methodology: Data Product Flow.

Data Product Flow

Each event is generated by a command/action and a related actor/trigger. To make an action, people in charge of that step need to make decisions. The first step of our process is to let all the participants put a violet card describing the decision driving the creation of that event and related actions. This decision could be the actual one or how it could be in a perfect world. People participating should also think about the global strategy of their domain and how those decisions will help to move a step forward in that direction. We can now introduce the concept of Data Opportunities:

Data Opportunity is:

  • A more intelligent and more effective way to get a decision supported by data. You could be able to automate a decision fully or create some suggestions for it.
  • A brand new business idea that is going to be supported by Data. In that case, it will not be part of the identified aggregates.

At this stage, let’s start to talk about Data Products to emphasize that, from now on, the focus will entirely be on the data journey and how to be data-driven.

In this phase, the facilitator needs to trigger some ideas and do some examples to let the participants realize that now we are changing direction and way of thinking (compared with the previous part of the workshop). We are not focusing anymore on the actual business process, but we are projecting into the future. We are imagining how to optimize and innovate it, making it data-driven.

When people realize that they are allowed to play, they will have fun. It is crucial to let people’s creativity flow.

Once all the business decisions are identified, we start to think about what data could be helpful to automate ( fully or partially) them. Wait, but until now, we didn’t talk about data so much. Where the hell are they?

Ok, it’s time to introduce the read model concept. In DDD, a read model is a model specialized for reads/queries and it takes all the produced events and uses them to create a model that is suitable to answer clients’ queries. Typically we observe this pattern with CQRS, where starting from commands, it is possible to create multiple independent read-only views.

When we transpose this concept in the Data Mesh world, we have Output Ports.

For each Business Decision, let’s identify datasets that can provide help in automating them. Create a green post-it in each data product where we believe it is possible to find the data we need and position them close to the events that probably generate the data underlying such output port.

Please, pay attention to giving a meaningful name to the output port, respecting ubiquitous language and not trying to rationalize it ( there is always time to clean the picture, early optimization is also not good in this field ). I repeat, because it is vital, please provide domain-oriented and extensive output port names, don’t try to define a “table name”.

Do the same for “new business decisions”, those that are not part of the actual business process.

At this stage, your picture should be more or less like the following one.

If you are working in a real environment, you will soon realize something scary: while you are searching for datasets that can help with your business decisions, you would be able to imagine what information could be helpful, but you would not be able to find the right place for it on the board, because simply that kind information does not exist and you need to create it.
Probably this is your first consumer-aligned Data Product !!
Start creating a placeholder (in yellow) and create the output ports needed to support your business decisions.

These new output ports can be part of the same data product, or you may need to split into different data products. To define it, we need to apply some DDD principles again. This time we need to think about the business processes that will generate such data:

  • Is the same business process generating both datasets at the same time?
  • Do we need consistency between them?
  • Can we potentially apply for different long-term ownership on them or not?

There are many other rules and ways to validate data product boundaries, but it is not the focus of this article. Fortunately, in this case, we can unify the process under unique ownership and have two different output ports for the same data product because they have high functional cohesion. Still, we want to provide optimized read models to our customers.

This process can also have several steps, and you need to proceed backwards, but always starting from business decisions.

In this phase, if it is becoming too abstract and technical for domain experts, simplify and skip it; you can work on it offline later. It is essential to don’t lose domain expert attention.

Because Data Mesh is not embracing the operational plane of data, we now need to do something fundamental to stay consistent and model authentic data products.

At this stage, if on your board you have a data product that is taking business decisions leveraging output ports of other data products and also has the “system” post-it, it means you are mixing operational and analytical planes. It cannot be accurate because the business process is happening only in one of the two. Data Product 1 is purely operational because it is not going to make data-driven decisions. Data Product 2/3 processes are happening on operational systems for sure because we started from there. Still, we would like to automate or support some business decisions by reading data from the analytical plan. When you detect this situation is better to split the data product by keeping a source-aligned data product that maps on the operational system’s data and then create an additional DP to include the newly added value business logic. These are more consumer-aligned DPs because they are reading from multiple data products to create something entirely new and new ownership.

To understand better, this last step could be helpful to visualize it differently. With the standard business process (AS-IS), users make decisions, propagating them in the source-aligned DP. Suppose we want to support the user to make smarter or faster decisions. In that case, we can provide some data-driven suggestions through a consumer-aligned data product that is also mixing-in data from other DPs. In the end, it is also the same business decision. Still, it is better to split it into executive decisions and suggestions to separate the operational plane and the analytical one. If you are thinking of shrinking everything in the source-aligned DP, keep in mind that it includes the executive business decision (facts). You will end up mixing operational and analytical planes without any decoupling. It is also dangerous from an ownership standpoint because once you consume data from other domains, the DP will not be “source-aligned” anymore.

Also, if you can automate such business decisions fully, I strongly recommend keeping them as separate Data Products.

How to formalize all this stuff? Can we create a deliverable that is human readable and embeddable in documentation?

DDD is helping us again. We can use Bounded Context Canvas to represent all the Data Products.

I propose two different templates:

  • Source-aligned DP
  • Consumer-aligned DP

The final result is quite neat and clean.
Welcome, Data Product Flow as practice and workshop format.

Author: Paolo Platter, Agile Lab CTO & Co-Founder

STAY TUNED!

If you made it this far and you’re interested in more info about the Data Mesh topics, or you’d like to know more about Data Mesh Boost, get in touch!

 

My name is Data Mesh. I solve problems.

My name is Data Mesh. I solve problems.

In spite of the innumerable advantages it provides, we often feel that technology has made our work harder and more cumbersome instead of helping our productivity. This happens when technology is badly implemented, too rigid in its rules or with a logic far removed from real world needs or even trying to help us when we don’t need it (yes, autocorrect, I’m talking to you: it is never “duck”!). But we would never go back to having to drive with a paper map unfolded on our lap in the “pre-GPS” days or to browsing the yellow pages looking for a hotel, a restaurant, or a business phone number. We need technology that works and that we are happy to work with. Data Mesh, like Mr. Wolf in Quentin Tarantino’s “Pulp Fiction” solves problems, fast, effectively and without fuss. Let’s take a look to see what it means in practice.

Imagine that….

What’s more frustrating than needing something you know you have, but you don’t remember where it is, or if it still works? Imagine that this weekend you want to go to your lake house, where you haven’t been in a long while. There is no doubt you have the key to the house: you remember locking the door on your way out and then thinking long and hard about a safe place where to store it. It may be at home, or even more likely in the bank’s safe deposit. Checking there first, while it wields a high probability of success, means having to drive to the bank, during business hours, and the risk of wasting half a working day for nothing. You can look for the key at home in the evening and it should be faster, but could also take forever if you don’t find it and keep searching, and then if it’s not at home you’d have to go to the bank anyway. You even had a back-up plan: you gave your uncle Bob a copy of the key. The problem is that you can’t remember if you also gave him a new one after you had to change the bolt in 2018…. If you did, Bob’s your uncle, also metaphorically, but are you willing to take the chance, drive all the way to the lake and maybe find out that the old key no longer works? You wanted to go to relax, but it looks like you are only getting more and more stressed out with the preparation, and now you are thinking if it’s really worth it…

That’s a real problem

The above scenario seems to only relate to our private lives.
In reality, this is one of the biggest hurdles in a corporate environment, as well, with the only difference being that at home we look for physical, “tangible” objects, while the challenge in a modern company is finding reliable data. It is not just a matter of knowing where the data is, but also if we can really trust it. In our spare time this is annoying, but we can afford this kind of uncertainty. More frequently than we care to admit, we either spend a disproportionate amount of time looking for what we have misplaced, making sure that it still works or fixing it, or we just plainly give up doing something pleasant and rewarding because we can’t be bothered to search for what we need, not knowing how long it will take. In the ever more competitive business world (whichever your business is, the competition is always tough!), avoiding extra expenses and never passing an opportunity are an absolute must. And you can’t afford these missteps “just” because you can’t find or trust your data.

Today it is highly unlikely that a company doesn’t have the data it needs for a report, a different KPI, a new Business Intelligence or Business Analytics initiative, for analysis to validate a new business proposition and so on. We seem, on the contrary, to be submerged by data and to always be struggling to manage it.
New questions arise. The real questions we face are therefore about the quality of data. Is it reliable? Who created that data? Does it come from within the company or has it been bought, merged after an acquisition, derived from a different origin dataset? Has it been kept up to date? Does it contain all the information I need in a single place? Is it in a format compatible with my requirements? What is the complexity (and hence the cost) of extracting the information I need from that data? Does that piece of information mean what I think it means? Is there anybody else in the company who already leveraged that dataset and, if so, how was their experience?

Ok, so there might be problems that need solving before you can use a certain data asset. That, by itself, is par for the course. You know that every new business analysis, every new BI initiative comes with its own set of hurdles. The real challenge lies in the capability of estimating these challenges beforehand, so as not to incur in budget overruns or other costly delays. It seems that every time you try to analyze the situation you can never get a definitive answer from data engineers regarding data quality or even the time needed to determine if the data is adequate. There is not way you can set a budget, both in terms of time and resources, to resolve the issues when you can’t know in advance which issues the data might or might not have, and you can’t even get an answer regarding how long it will take and how expensive it will be to find out. It is downright impossible to estimate a Time-To-Market if you don’t know either what challenges you will face, or when you will become aware of those challenges. Therefore, how can you determine if a product or service will be relevant in the marketplace, when there is no way to know how long it will take to launch it? Should you take the risk, or should the whole project be canned? This is the kind of conundrum that poor data observability and reusability, combined with an ineffective data governance policies put you into. And that is a place where you really don’t want to find yourself.

What companies have just recently come to realize is that the “data integration” problem is, in general, more likely to be tackled as a social, rather than a technical one. Business units are (or at least should be) responsible for data assets, taking ownership both in the technical and functional sense. In the last decade, on the contrary, Data Warehouse and Data Lake architectures (with all their declinations) took the technical burden away from the data owners, while keeping the knowledge over such data in the hands of the originating Business Units. Unfortunately, a direct consequence has been that the central IT (or data engineering team), once they put in place the first ingestion process, “gained” ownership of such data, thus forcing a Centralized Ownership. Here the integration breaks up: potential consumers who could create value out of data must now go through the data engineering team, who doesn’t have any real business knowledge of the data they provide as ETL outcomes. This eventually ends up in the potential consumers not trusting or not being able to actually leverage the data assets and thus being unable to produce value in the chain.

A brand new solution 

All the above implies that the time has come for not just as another architecture or technology, but rather a completely new paradigm in the data world to addresses and solve all these data integration (and organization) issues. That’s exactly what Data Mesh is and where it comes into play. It is, first and foremost, a new organizational and architectural pattern based on the principle of domain-driven design, a principle proven to be very successful in the field of micro-services, that now gets applied to data assets in order to manage them with business-oriented strategies and domains in mind, rather than being just application/technology-oriented.

In layman’s terms it means data working for you, instead of you working to solve technical complexities (or around them). Moreover, it is both revolutionary, for the results it provides, and evolutionary, as it leverages existing technologies and is not bound to a specific underlying one. Let’s try to understand how it works, from a problem-solving perspective.

Data Mesh is now defined by 4 principles:

  • Domain-oriented decentralized data ownership and architecture 
  • Data as a product 
  • Self-serve data infrastructure as a platform 
  • Federated computational governance. 

We already mentioned that one of the most frequent reasons of data strategies failures is the centralized ownership model, because of its intrinsic “bottleneck shape” and inability to scale up. The adoption of Data Mesh, first of all, breaks down this model transforming it into a decentralized (domain-driven) one. Domains must be the owner of the data they know and that they provide to the company. This ownership must be both from a business-functional and a technical/technological point of view, so to allow domains to move at their own speed, with the technology they are more comfortable with, yet providing valuable and accessible outcomes to any potential data consumer.

Trust and budget

“Data as a product” is apparently a simple, almost trivial, concept. Data is presented as if it were a product, easily discoverable, described in detail (what it is, what it contains, and so on), with public quality metrics and availability guarantee (in lieu of a product warranty). Products are more likely to be sold (reused, if we talk about data) if a trust relationship can be built with the potential consumers, for instance allowing users to write reviews/FAQs in order to help the community share their experience with that asset. The success of a data asset is driven by its accessibility, so data products must provide many and varied access opportunities to meet consumer needs (technical and functional), since the more flexibility consumers find the more likely they are to leverage such data. In the current era, data must have time travel capabilities, and must be accessible via different dynamics such as streams of events (if reasonable considering data characteristics), not just as tables of a database. But let’s not focus too much on the technical side. The truly revolutionary aspect is that now data, as a product, has a pre-determined and well-determined price. This has huge implications on the ability of budgeting and reporting a project. In a traditional system, even a modern one like a Data Lake, all data operations (from ingestion, if the data is not yet available, to retrieval and preparation so you can then have it in the format you need) are demanded to the data engineers. If they have time on their hands and can get to your request right away, that’s great for you, but not so much for the company, as it implies that there is overcapacity. This is not efficient and very, very expensive. If they are at capacity dealing with every-day operational needs and the projects already in the pipeline, any new activity has to wait indefinitely (also because of the uncertainties we discussed before) in a scenario where IT is a fixed cost for the company and therefore capacity can not be expanded on demand. Even if it can be expanded, it is nightmarish to determine how much of the added capacity is directly linked to your single new project, how much of the work can or could be reused in the future, how much would have eventually had to be done and so on. When multiple new projects start at the same time, the entire IT overhead can easily become an indirect cost, and those have a tendency to run out of control in each and every company. With Data Mesh, on the other hand, you have immediate visibility of what data is available, how much (and how well) it’s being used and how much it costs. Time and money are no longer unknown (or worse, unknowable in advance).

A game changer

To understand the meaning of the next two points, “Self-serve data infrastructure as a platform” and “Federated computational governance”, let’s draw a parallel with a platform type that “changed the game” in the past. It is an over-simplification, but bear with us. Imagine you work in Logistics and you need to book a hotel for your company’s sales meeting. You have a few requirements (enough rooms available, price within budget, conference room on premise, easy parking) and a list of “nice-to-have” (half board or restaurant so you don’t have to book a catering for lunch, shuttle to and from the airport for those who fly in, not too far from the airport of from the city center). Your company Data Lake contains all the data you need: it contains all the hotels in the city where the meeting will take place. But so did the “Yellow Pages” of yesteryear. How long does it take to find a suitable hotel using such a directory? It’s basically unknowable: if the Abbey Hotel satisfies all requirements, not too long, but good luck getting to the Zephir Hotel near the end of the list. Not only they are sorted in a pre-determined way (alphabetically), but you have to check each one in sequence by calling on the phone if you want to know availability, price and so on. Moreover, the quantity of information available for each hotel directly in the Yellow Pages is wildly inconsistent. It would be nice to be able to rule out a number of them, so you don’t have to call, but some hotel bought big spaces where they write if they have a conference room or a restaurant, other just list a phone number. If you also need to check the quality of an establishment, to avoid sending the Director of Sales to a squalid dump, the complexity grows exponentially, like the uncertainty of how long it will take to figure it out. Maybe you are lucky and find a colleague who’s already been there and can vouch for it, or you’d have to drive to the place yourself. When you budget the company sale’s meeting how do you figure out how much it will cost, in terms of time and money, just finding the right hotel? And if not having an answer wasn’t bad enough, the Sales Director, who’s paying for the convention, is fuming mad because this problem repeats itself every single year and you can never give an estimate, because the time it took you to find a hotel last time is not at all indicative of how long it will take next time.

If you now change the word “hotel” with “data” in the above example, you’ll see that it might be a tad extreme, but it is not so far-fetched. A Data Mesh is, in this sense, like a booking platform (think hotels.com or booking.com) where every data producer becomes a data owner and, just like a hotel owner, wants to be found by those who use the platform. In order to be listed, the federated governance, imposes some rules. The hotel owner must list a price (cost) and availability (uptime), as well as a structured description of the hotel (dataset) that includes the address, category and so forth, and all the required metadata (is there parking? a restaurant? a pool? breakfast is included? and so on). Each of these features becomes easily visible (it is always present and shown in the same place), searchable and can act as a filter. People who stay at the hotel (use the dataset) also leave reviews, which help both other customers in choosing and the hotel owner in improving the quality of its offer or at least the accuracy of the description.

The “self-serve” aspect is two-fold. From a user perspective it means that with such a platform the Sales department can choose and book the hotel directly, without needing (and paying for) the help of Logistics (Data Lake engineers). From an owner perspective (hotel or data owner) it means that they can independently choose and advertise what services to offer (rooms with air conditioning, Jacuzzis, butler service and so on) in order to meet and even exceed the customers’ wishes and demands. In the data world this second aspect relates to the freedom of data producers to autonomously choose their technology path, in accordance with the federated governance approved standards.

Last, but definitely not least, the Data Mesh architecture brings to the table the ease of scalability (once you have all the Hotels/datasets in one city, the system can grow to accommodate those of other cities, as well) and reuse. Reuse means that the effort you spent in creating a solution can, at least in part, be employed (reused) to create another. Let’s stick to the hotel analogy. If you created it last year and now want to do something similar for B&Bs, there is a lot you don’t have to redo from scratch. Of course, the “metadata” will be different (Bed and Breakfast don’t have conference rooms), but you can still use the same system of user feedback, the same technology to gather information on prices and availability, that once again will be up to the B&B’s owner to keep up to date, and so on.

A no brainer?

Put like that, it seems that “going with the Data Mesh” is a no-brainer. And that can be true for large corporations, but do keep in mind that building a Data Mesh is a mammoth task. If you only have three or four hotels, it goes without saying, it doesn’t make sense to build a booking platform. What’s important to keep in mind, though, is that a Data Mesh architecture, to express its full potential, requires a deep organizational change in the company. To cite the most obvious aspect, the data engineers need to “migrate” from the center (the Data Lake) to the data producers, to guide them in the process of properly preparing the data, conforming to the federated governance rules and exposing it correctly so that it can be found and utilized (thus also generating revenue through internal sales for the data owner). It also requires a change of mentality, so that the whole company can start to visualize data as a product, data producers as data owners and break free from the limitations and bottlenecks of a Data Lake, reaping the benefits of a truly distributed architecture and the new paradigm.

Luca Maestri, Chief Financial Officer of Apple, famously said that people tend to attribute the success of huge companies, the likes of Apple, Amazon, Google, or Facebook, to their being creative laboratories, where a great number of innovative ideas can emerge, but his experience taught him that this is not the case. These companies succeed because they are “execution machines”. In other words, a great idea has no value if you can not execute upon it effectively and quickly: on time and on budget. But, first, you need the right tools to be able to determine time and budget constraints. Creating a Data Mesh is a huge undertaking, but it means building the solid foundations that will support the evolution of your data-driven business. You can have all the data in the world into your Data Lake, but if you can’t leverage it, effectively and sustainably, you won’t move forward. Because in today’s world standing still means going backwards, the only way to stay competitive is to create new products, services, and solutions for your customers. In order to be an “execution machine”, you need to be able to spend your time looking for opportunities, instead of, searching for your data, analyzing the marketplace and chasing new clients and upset propositions, instead of rummaging through your Data Lake.
If you can do all of that, you definitely deserve a relaxing, rewarding week-end, to look at the placid lake from your house and remember the time when your Data Lake was equally still, unmoving and hard to see under the surface. Come Monday, it will be time for a new week of effective, efficient, data-driven new business.

 

On this topic, you might be interested in the How and why Data Mesh is shaping the data management’s evolution, or you can learn more about how Data Mesh boost can get your Data Mesh implementation started quickly.

STAY TUNED!

If you made it this far and you’re interested in other articles on the Data Mesh topics, sign up for our newsletter to stay tuned. Also, get in touch if you’d like us to help with your Data Mesh journey.

The rise of Big Data testing

What is big data testing?

Big data testing is the process of testing data for data and processing integrity; and to validate the quality of big data and ensure data is flawless to help business derive the right insights, according to Hitesh Khodani, Test Transformation Manager.

Marco Scalzo, Big Data Engineer at Agile Lab, adds that when we talk about big data, we refer to a large collection of data, which is potentially unbounded if we are in a real-time streaming model, on which we execute a pipeline of operations in order to process data and store them in an external sink such as a database.

Read the full article on The Software Testing News, the premium online destination for software testing news, reports, white papers, research and more.

Covering all aspects of software testing in all main verticals, you can be sure that Software Testing News will keep you informed and up to date.
The Software Testing News website gives you different possibilities to market directly to your customers. 

Read now!

 

 
 

Extending Flink functions

For the most part, frameworks provide all kinds of built-in functions, but it would be cool to have the chance to extend their functionalities transparently.

In particular, in this article, I want to show you how we solved an issue that emerged while working with Flink: how to add some custom logic to the built-in functions already available in the framework in a transparent way. I will present our solution with a simple example: adding a log that tells how long each element took to be published by a sink, but this solution can be extended to add your custom logic to any Flink component.

In this article, after a short introduction to Flink, I will analyze a little bit more its functions (sinks in particular): the interfaces provided by Flink, the methods they expose, how they are implemented, and a basic interface hierarchy. Then I will proceed to show you how regular solutions (extending the basic interfaces or wrapping them) are not enough to solve the problem. In the end, I am going to show you the actual solution using Java dynamic proxy, which will allow us to extend the functions by adding our custom code. Finally, the last paragraph will try to explain why our solution could be improved even more by handling the Object methods in a specific way: this will allow us to make our dynamic proxy with custom logic completely indistinguishable from the function it enriches.

Flink in a nutshell

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

As reported in the documentation, data can be processed as unbounded or bounded streams:

  1. Bounded streams have a defined start and end. When processing bounded streams you have full control over data since you already know all of it. This allows you to ingest all elements before performing any computation on them (e.g. you can perform a consistent sort of all the events before processing them). Whenever a bounded stream is elaborated we talk about “batch processing”.
  2. Unbounded streams have a known start but no defined end. They are an actual flow of elements continually generated (so they should be continually processed as they arrive). It is not possible to reason about the whole data set since it is continually generated (we cannot sort the input elements), but we can only reason about the time at which each event occurred. The elaboration of unbounded streams is known as “stream processing”.

Even if Apache Flink excels at processing both unbounded and bounded data sets, we will focus on the feature for which it is best known: stream processing. A detailed time and state management enable Flink’s runtime to run different kinds of applications on unbounded streams.

Flink applications rely on the concept of streaming dataflow: each streaming dataflow is a directed graph composed of one or more operators. Each operator receives its input events from one or more sources and sends its output elements to one or more sinks. Flink provides a lot of already defined sources and sinks for the most common external storages (message queues such as Kafka or Kinesis but also other endpoints like JDBC or HDFS) while operators are usually user-defined since they contain all the business logic.

It is very simple to define a streaming flow in Flink: given a Flink environment simply add a source to it, transform the data stream with your logic and publish the results by adding a sink to it:

The data stream code is mapped to a direct logical graph and then executed by the framework. Usually, when the user defines its transformation they will be mapped to a single operator in the dataflow, but sometimes complex transformations may consist of multiple operators.

Flink logic dataflow (image from the official documentation)

All Flink streams are parallel and distributed: each stream is partitioned and each logical operator is mapped to one or more physical operator subtasks. Each operator has associated parallelism that tells Flink how many subtasks it will require (in the same stream different operators can have different parallelisms); each subtask is decoupled from all the others and is executed in a different thread (and maybe on different nodes of the cluster).

Meet the Flink functions

The invoke() method is invoked for each input element, and each sink implementation will handle it by publishing it to the chosen destination.

Since some sinks require additional functionalities to grant data consistency and/or state management, Flink defines extensions of the SinkFunction interface that include other utility methods. The first example could be the abstract class RichSinkFunction that extends not only SinkFunction but also AbstractRichFunction to include initializationtear-down, and context management functionalities:

Each concrete sink will implement only the sink interface they need to perform their logic, some examples could be:

  • DiscardingSink, which is an implementation that will ignore all the input elements, extends only SinkFunction since it does not require state management either initialization;
  • PrintSinkFunction, which writes all the input elements to standard output/error, extends RichSinkFunction because it will initialize its internal writer in the open() method using the runtime context;
  • FlinkKafkaProducer, the official Kafka sink, extends TwoPhaseCommitSinkFunction because it requires some additional functionalities to handle checkpoints and transactions.

Even if they differ in behavior, all these implementations, and other functions that add other functionalities, extend the base interface SinkFunction. Given this structure of the sinks (that is similar for sources and operators too), we can now take a step further to analyze how we can add functionalities to an existing sink.

One does not simply wrap a Flink function

Obviously one could extend each existing sink with a custom class overriding the default behavior; this approach is not sustainable since we would encounter multiple problems: we would define various classes all with the same code replicated, we would need to keep this set of extensions updated at each new implementation released, and we would be stuck in case the implementation overrides the abstract method we need to extend with the final keyword.

So a more reliable first approach would be to define a wrapper around the sink (the wrapper would implement SinkFunction itself) and redirect its invoke method to the inner sink after applying the additional logics:

Now we can add our wrapper to any data stream and it will transparently perform the publishing defined inside the inner sink along with our custom logic:

The problem with this approach is that if we use our SinkWrapper we lose the actual specific interface implemented by the inner sink: e.g. even if the inner sink is a RichSinkFunction, our wrapper would be a simple SinkFunction. This should not be an issue, since as shown above the signature of the addSink() method of DataStream takes a SinkFunction:

Actually, it happens that this is a problem because even if there are no compilation errors, Flink handles differently the various kinds of SinkFunction by checking their type at runtime. In fact, if you take a look at the Flink classes FunctionUtils or StreamingFunctionUtils, you will notice how their methods check the actual type of the function to invoke specific methods on it, e.g.:

So our wrapper solution cannot work for all the sinks that are not only SinkFunction but extend other interfaces too (which are the vast majority of the sinks). Our wrapper would need to implement also these interfaces and redirect all the abstract methods to the wrapped element, but in this way, we are back to a similar problem of defining a specific class for each implementation!

It is dangerous to go alone, take a Java dynamic proxy

An invocation handler is simply an instance responsible for defining a logic associated with the invoke() method: this method will be invoked for all the methods that the proxy exposes. The invoke() method of the handler will be called with three parameters: the proxy object itself (we can ignore it), the method invoked on the proxy and the parameters passed to that method.

As a first example we can define a simple generic InvocationHandler that logs how many times a method was invoked:

Note that in this example the method invoked on the proxy will only log the number of times the method was invoked without doing anything else!

Now that we have a handler we can define an actual proxy that implements the desired interfaces:

As shown in the snippet the proxy creation requires three parameters:

  1. the class loader to define the proxy class
  2. the list of interfaces for the proxy class to implement
  3. the invocation handler to dispatch method invocations to

The advantage of this solution is that the resulting proxy implements all the interfaces that we chose while adding the logic of the handler to all methods invocations.

Almost there, just another step

First of all, we define our wrapper that will contain an existing sink:

Then we need to define an InvocationHandler that contains our logic, and to keep the solution simple the wrapper itself can implement it:

For the moment we can leave the invoke method unimplemented, we are going back to it soon, and let’s focus on the creation of our dynamic proxy. Since we want the users to use our wrapper class transparently, we can make the constructor private and provide a static utility method that will wrap the sink directly with the proxy:

Notice that the wrap() method will be the only access point to our wrapper, and it will always return the dynamic proxy built on top of our wrapper. The ClassUtils.getAllInterfaces() method, which returns all the interfaces implemented by a class, is defined inside Apache’s commons-lang3 which is imported as a dependency by Flink itself.

So whenever a sink is wrapped using the wrap() method, the resulting sink will implement all the interfaces of the original sink but all the methods invoked on it will pass through our handler. Now we can implement the invoke() method to add our logic to the wrapped sink.

Since we need to add our custom logic only to the sink invoke() method, we need to check which method was called: if it is the sink invoke() method we add our logic around the invocation (e.g. logging how much the method took to process the input element) while if it is another one we can invoke it directly on the wrapped sink:

Pretty simple, right? We check if the method intercepted is the SinkFunction invoke() and then we add our logic before and/or after calling it on the wrapped sink. For all the other methods, they are invoked directly on the inner sink.

Our wrapper is ready and we can use it in our data stream by simply wrapping our sink:

And Another Thing…

This would be true for all methods apart from the Java “Object methods”: equals()hashCode(), and toString(); we need to handle these Object methods with specific logics since they are invoked on the proxy and not on the wrapped sink, so we could have inconsistencies if we use the proxy instead of the wrapped element. To solve this issue we can:

  • handle the equals() method to check if the compared object is a proxy too: in this case, we can compare the two inner sinks. If it is not a proxy we can call the underlying equals method directly, avoiding using reflection.
  • redirect the hashCode() and toString() methods directly onto the wrapped sink.

The interesting part of the solution presented is that we can also handle operations specific to particular kinds of sinks since we implement all their interfaces. As an example, in our InvocationHandler we could intercept also the open() and close() methods of RichSinkFunction, knowing that these methods will be invoked only if the wrapped sink is a RichSinkFunction.

If you made it this far, you may be interested in other Big Data articles that you can find on our Knowledge Base page. Stay tuned because new articles are coming!

Written by Lorenzo Pirazzini – Agile Lab Big Data Engineer
 
Remember to follow us on our social channels and on our Medium Publication, Agile Lab Engineering.

 

 

Scala “fun” error handling — Part 2

As the title suggests, this is the second post of a series about error handling in a functional way in Scala.

Last time we saw how we can encode our results into Either or Option and then combine these results in a very simple and effective way. Our code looked like imperative, sequential code, but we always handled the possibility of failure that most functions have.

In this article, we will see that life is not sequential (not always at least) and how we can take care of non-sequential situations in a functional way.

Let’s recall the last exercise we solved:

As it is (kind of) obvious, there aren’t any dependencies between the computation of rlatand rLong. Even if they are not dependant, in our last “solution” we don’t even bother checking the outcome of f1(long)if f1(lat) produces a Left, “thanks” to the short-circuiting property of Either.

In most cases, that would be a nice feature, but it’s not always the case. Sometimes you want to compute all the not inter-dependant paths and return all the errors (for example when you are validating an input form).

We can do it “by-hand” and provide a solution like the following:

So, this works, but… It’s very cumbersome and doesn’t scale at all (if you had 3 results to compute and combine in parallel, you will have to write 8 (2³) cases instead of 4 (2²). There must be something easier and with better scaling properties!

Validated[E, A]

Validated is very similar to Either but with different “rules”. It also has very “nice” type aliases (we will see how useful they are later), and a couple of very handful methods and functions to create and convert them to/from Either:

Ok, enough with the syntax, how do we use this Validated?

Let’s try to write the application of f1 to both lat and long parameters using Validated, we are going to do it step by step.

First of all, let’s define the signature

def applyF1ToBoth(lat: Int, long: Int): Either[E, (Double, Double)]

As we can see from the signature, we want to either fail with an E, or succeed with a pair of doubles.

Then, we apply the function to both latand long and place the results into a tuple, but not as Either, let’s transform them into ValidatedNonEmptyChain.

🤔 Why a ValidatedNonEmptyChain[E, Double]instead of a “simple” Validated[E, Double] (like the “original” return type, which is Either[E, Double])?

That is because we want to be able to combine a non-empty collection of errors (in our case it might be an error or no error at all, but the Chain data structure helps us to combine multiple errors).

As we can see from the code snippet, we have now aTuple2[ValidatedNec[E, Double], ValidatedNec[E, Double]], quite different from the expected result, but don’t worry! 😀

Now we need to take care of the results if they are both successful, and we do so using mapN:

The solution now starts to take form. We can see it because the second generic argument of the ValidatedNec now resembles the one of the return type (Double, Double). Next step is to take care of the (possible) list of errors and reduce it to a single E (which in our case is String), to do so we are only going to concatenate the error strings, placing a \n in between:

Ok, so far so good, we have a Validated instead of an Either, but this conversion is dead simple:

Done! 🎉

Now we can compose this new function with the “old” functions to obtain the final result:

So we’ve just seen an approach that is:

– Less verbose than the match approach
– “Scales” up to 22 elements of any combined type

Intuitions

Validated[E,A] and Either[E, A] are “cousins”, they both express a computation that fails or succeeds in some way.

Disclaimer: in this paragraph I will swear a bit, because math is hard, but logical reasoning isn’t.

The main difference is on how they compose:

– with Either you stop at the first Left
– with Validated you go on aggregating Invalid

Unfortunately this aggregation ain’t free! 😬

In fact, you cannot perform the mapNtrick in all cases:

What’s a semigroupal?

From cats documentation:

Combine an `F[A]` and an `F[B]` into an `F[(A, B)]` that maintains the effects of both `fa` and `fb`

So, why Either always has a Semigroupal while Validateddoesn’t?

  • With Either you stop at the first “error”, you don’t need any logic to aggregate errors because you just don’t aggregate
  • With Validated you have to “aggregate” the error part

It is actually saying: “How do I do Throwable |+| Throwable”?

There is no “universal” meaning of

While it’s naturally there for:

Semigroup

This concept of “merge” or “combine” has a name, and it’s semigroup:

A “type” has a semigroup if it respects a rule (also known as law).

So whenever you define a custom semigroup, be sure to check that is really a semigroup (i.e. it’s associative). Just remember: aggregate things ⇨ Semigroup, in cats it is encoded as a typeclass: Semigroup[A]

cats already has semigroups for a lot of types:

TIP: If you are tempted to code a Semigroup instance for a “primitive” type because cats doesn’t have one, think twice: it’s very likely that you are creating an unlawful (or was it chaotic?Semigroup

For the most adventurous ones, the concepts that we just see correspond to the name of:

  • Semigroup
  • Monoid
  • Applicative
  • Functor
  • Monad
  • MonadError
  • ApplicativeError

Yeah, I know the names are scary… I personally go with intuitions and then in the end figure out the theory.

You might already have figured out that 🐱 namespaces follow this convention:

  • cats.syntax.${thing}._: contains all extension methods for that “thing” (not only data structures but also typeclasses like applyfor mapN )
  • cats.instances.${thing}._: contains all “universal”: instances for that “thing” ( cats.instances.list._ contains Semigroup[List[A]])

Bonus: Traverse

There is a running joke/non-joke in the FP community, which is:

It’s always traverse

So I think it is very useful to spend two words to explain what is traverse and how can be exploited.

Traverse has the ability to “turn inside out” things that are “traversable” which contain things that can be “mapped over”.

For example, did it ever occurred to you the need to do some of these things?

Start easy with Option :

How does this work?

That’s possible because:

  • Exists an instance of Traverse[List[_]]
    — The content does not matter, hence _
  • Exists an instance of Applicative[Option[_]]
    — The content does not matter, hence _

So, is it possible to do the same with Either?

That’s possible because:

  • Traverse[List[_]] exists
    — The content does not matter, hence `_`
  • An Applicative[Either[E, _]] exists
    — Emust be fixed, there must be just one “hole”

I guess we can do kind of the same with Validatedtoo…

Sure!

That’s possible because:

  • Exists an instance of Traverse[List[_]]
    — The content does not matter, hence _
  • Does not exists an instance of Applicative[Validated[E, _], A]
  • But, exists an instance of ApplicativeError[Validated[E, _], A]and
  • Exists an instance of Semigroup[E]

So:

And that is automatically solved by Scala implicits ❤️

Take away

  • These “things” are real
  • you end up dealing with these “things” every day
  • you just don’t model them this way
    — or don’t model them at all
  • If you do this, your signature will speak to the world (and the compiler!)

Q.A.

But I usually obtain the same things “manually”

This means you need them, but you have your custom “rules” that only your team knows.

cats is “universally accepted”

So now every function will return an Either or a Validated and Exceptions are never thrown?

No, my personal advice, is:

  • use only Either in signatures
    — transform them to Validated when you need to mapN
  • catch only business errors, make “bubble up” exceptions
    — exceptions should be unexpected, and 99% of the time: transient (a broken DB, a connection failure and so on)

I didn’t, we will see how to handle REAL exceptions in a principled and typesafe manner, but not today.

The thing is:

  • what we have seen so far should not deal with “side-effects”
  • it’s good for business logic, not for interacting with the outer world

For that there might be other articles in the Future[_]

Finally, I’ll leave you with a couple of @impurepics pics, because if you got till here, you deserve it.

What our business logic should look like:

If you made it this far, you may be interested in Part 1 or in other tech articles that you can find on our Knowledge Base page.

Stay tuned because other articles are coming!

Written by Antonio Murgia – Agile Lab Big Data Architect
 
Images by @impurepics
 
Remember to follow us on our social channels and on our Medium Publication, Agile Lab Engineering.