Data Product Specification: power up your metadata layer and automate your Data Mesh with this practical reference

Data Product Specification: power up your metadata layer and automate your Data Mesh with this practical reference

Just like every informative content in the Data Mesh area should do, I’d like to start quoting the reference article on the topic, by Zhamak Dehghani:

[…] For data to be usable there is an associated set of metadata including data computational documentation, semantic and syntax declaration, quality metrics, etc; metadata that is intrinsic to the data e.g. its semantic definition, and metadata that communicates the traits used by computational governance to implement the expected behavior e.g. access control policies.

A “set of metadata” is hereby to be associated with our Data Products. I’ve read a lot of articles (I’m gonna report some references along the way in the article) about why it’s important and what theoretically can be achieved by leveraging them… they all make a lot of sense. But — as frequently happens when someone talks about Data Mesh related aspects — there’s often a lack of “SO WHAT”.

For this reason, I’d like to share, after a brief introduction, a practical approach we are proposing at Agile Lab. It’s open-source, it’s absolutely not perfect for every use case, but it’s evolving along with our real-world experience of driving enterprise customers’ Data Mesh journeys. Most importantly, it’s currently used — in production, not just on articles — at some important customers of ours to enable the automation needed by their platform supporting their Data Mesh.

Why metadata are so important

I believe that automation is the only way to survive in a jungle of domains willing to create Data Products. Automation with respect to:

  1. easy and fast creation of Data Products leveraging pre-built reusable templates that guarantee out-of-the-box compliance to the Data-as-a-Product principle;
  2. implementation and execution of the Federated Governance policies (which therefore become “computational”);
  3. enablement of Self-Service Infrastructure, as a fundamental pillar to provide Decentralized Domains the autonomy and ownership they need.

The pillars together are all mandatory to reach the scale. They are complex to set up and spread, both as cultural and technical requirements, but they are foundational if we want to build a Data Mesh and harvest that so-widely promised value from your data.

But how can we get to this level of automation?
By leveraging a metadata model representing/associated with a Data Product.

Such a model must be general, technology agnostic, and imagined as the key enabler for the automation of a Data Mesh platform (to be intended as an ecosystem of components and tools taking actions based on specific sections of this metadata specification, which must be standardized to allow interoperability across the platform’s components and services that need to interact with each other).

State of the art

No clear sight due to an Atlantic perturbation — 2014, Madeira Island.
Photo by Roberto Coluccio — all rights reserved.

In order to provide a broader view on the topic, I think it’s important to report some references to understand what brought us to do what we did (OK, now I’m creating hype on purpose 😁).

  • Feed the Control Ports: Arne Roßmann in this article describes how control ports of Data Products should serve foundational APIs by leveraging metadata collected at different levels.
  • Active Metadata: Prukalpa Sankar in this article mentions a commercial product tackling the active metadata space as a way to autonomously collect precious pieces of information about several things from SQL plans to understand data lineage, to monitor Data Products update rate, and other very cool stuff.
  • Metadata types: Zeena in this post introduces their Data Catalog solution after reporting a very interesting differentiation of metadata into DESCRIPTIVE, STRUCTURAL, ADMINISTRATIVE.
  • Standards for Metadata in generalOpenMetadata is doing excellent work in creating standards and a framework for metadata management, including specific vocabulary and syntax (even if in some cases it becomes quite complex IMHO).
  • General Data Product definition: what gets closer to our vision is OpenDataProducts, a good work to factor out some important information around the general Data Product concept (e.g. Billing and SLA). However, generalizing too much is risky and, in this case, I believe they got a little too far from the way Data Mesh requires a Data Product to be described and built. For example, there’s no mention of Domains, Schema, Output Ports, or Observability.

According to these references, there’s still no clear metadata-based representation of a Data Product addressing — specifically — the Data Mesh and its automation requirements.

How we intend a data product specification

We believe in an Infrastructure-As-Code declarative, idempotent, and versioned approach. The goal is to have a standardized yet customizable specification addressing this holistic representation of the Data Mesh’s architectural quantum that is the Data Product.

Principles involved:

  • Data Product components (input ports, output ports, workloads, staging areas)
  • self-documenting Data Product (from schema to dependencies, to description and access patterns)
  • clear ownership (from Domain to Data Product team)
  • semantic linking with other Data Products (like polysemes to help improve interoperability and references to build a semantic layer or Knowledge Graph on top of the Data Mesh)
  • data observability
  • SLA/SLO and Data Contracts (expectations, agreements, objectives)
  • Access and usage control

The Data Product Specification aims to gather together crucial pieces of information related to all these aspects, under the strict ownership of the Data Product Owner.

What we did so far

In order to fill up this empty space, we tried to create a Data Product Specification by starting this open-source initiative:

https://github.com/agile-lab-dev/Data-Product-Specification

The repo contains a detailed documentation field-by-field; however, I’d like to point out here some features I believe to be important:

  • One point of extension and — at the same time — standardization is the expectation that some of the fields (like tags) are expressed in the OpenMetadata specification.
  • Depending on the context, several of those fields can be considered optional: their valorization is supposed to be checked by specific federated governance policies.
  • Although there’s an aim for standardization, the model is expandable for future add-ons: your Data Mesh (and platform) will probably keep evolving for a long time and the Data Product Specification is supposed to do the same, along the way. You can start simple, then add pieces by pieces (and related governance policies).
  • being in YAML format it’s easy to parse and validate, e.g. via a JSON parsing library like circe.

The Data Product Specification itself covers the main components of a Data Product:

  • general bounded context and domain metadata the Data Product belongs to (to support descriptive API on control ports)
  • clear ownership references
  • input ports, workloads, internal storage/staging areas, output ports, and logical dependencies between them
  • data observability
  • data documentation, semantic linking info, schema, usage guidelines (from serialization format and access patterns to billing models)

NOTE: what is here presented as a “conceptual” quantum can be (and it is, in some of our real-world implementations) split into its main componing parts that are then git-version controlled under their own repositories (belonging to a common group, which is the Data Product one).

Key benefits

The Data Product Specification is intended to be heavily exploited by a central Data Mesh-enabling platform (as a set of components and tools) for:

  • Data Products creation boost, since templates can be created out of this specification (for the single components and for the whole Data Products). They can then be evolved, also hiding some complexity from end-users.
  • Federated Governance policies execution (e.g. a policy statement could be “the DP description must be at least 300 chars long, the owner must be a valid ActiveDirectory user/group, there must be at least 1 endpoint for Data Observability, the region for the S3 bucket must be US-East-1”, etc…), thus making “approved templates” to be compliant by-design with the Data Mesh principles and policies.
  • Change Management improvement, since specifications are version controlled and well structured, thus allowing to detect breaking changes between different releases.
  • Self-service Infrastructure enablement, since everything a DP needs to be provisioned and deployed for is included in a specific and well-known version of the DP specification, thus allowing a platform to break it down and trigger the automated provisioning and deployment of the necessary storage/compute/other resources.

As I said, this specification is evolving and can be surely improved. Being open-source, every contribution is welcome.

Wrapping up

How can a Data Mesh journey reach the scale? With automation to guarantee out-of-the-box compliance of Data Products to the Data Mesh principles. 

If you’re interested in other articles on the Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Thank you for reading.

Author: Roberto Coluccio, Agile Lab Data Architect

Meet us at the Politecnico di Milano Career Day 2022

We’re exciting to be at the next Career Day by Politecnico di Milano, on 9th and 10th May 2022! It will take place both in-person, at the Campus Bovisa, and online through an ad hoc platform!

During the 2 days, you will have the chance to meet us directly, or virtually, at our booth, to know our team, learn more about our jobs, activities and life at Agile Lab.

Are you ready? Registrations will be closed on May 5th.

Reserve your seat now!

…98,99 and 100!!

We’re proud to announce that we officially reached the goal of 100 employees in the team!
We are immensely proud of our ever-growing team, which brought in years of experience and expertise to grow Agile Lab. Their commitment and dedication drive our success, year after year.

To celebrate this achievement, we have fixed some milestones of our history:

  • 2013: we opened our first offices, in Turin and Milan
  • 2017 we started our growth, with a team of 17 people
  • 2018: we double the team in a year and with 30 employees, we opened our office in Bologna
  • 2019: we started to follow the Holacracy Methodology, with our team of 40 members, and opened a new office in Catania
  • 2020: we have introduced the Certification Coach to increase the technical skills of our team (50 members)
  • 2021: we have continued our growth (85 members) and we opened the office in Padova and Bari, introducing also the Smartworking Plus experience
    in Fuerteventura.
  • 2022: we have welcome the 100th member in the team, but we are always in search for new talents. If you want to be a part, take a look at our open position.

Meet Agile Lab at the Data Innovation Summit 2022

We’re thrilled to announce that Agile Lab will be one of the sponsors of the 7th edition of the Data Innovation Summit in Stockholm. On the 5th and 6th of May, the team will be happy to meet you at our booth, A03, where you will have the chance to learn how to accelerate your journey to scalable Data Mesh in production.

On the first day:

– at 13,30 Paolo Platter, our CTO, and Gaetano Ruggiero, Enterprise Information Architect at Poste Italiane will have a joint talk: “How Poste Italiane leads its journey to Data Mesh leveraging Data Mesh Boost to avoid pitfalls and gain faster time to market” on M4-Data Management Stage

at the same time Lorenzo Pirazzini, Big Data Engineer at Agile Lab, is going to have a workshop session “Your first Data Products in 100 minutesin room W4, where he’s going to show how you can easily create and deploy Data Products using Data Mesh boost.

The Data Innovation Summit is the largest and most influential annual Data and AI event in the Nordics and beyond, bringing together the most innovative minds, enterprise practitioners, technology providers, start-up innovators and academics, working with Data Science, Big Data, ML, AI, Data Management, Data Engineering, IoT and Analytics, in one place to discuss ways to accelerate AI-driven Transformation throughout companies, industries and public organisations.

The Agile Lab team looks forward to meeting you. See you there!

March 2022

What’s in store – March22

 

 

An inspiring article by Roberto Coluccio, Agile Lab Data Architect, with 10 practical tips that can help companies in reducing Data Mesh’s adoption roadblocks and improve domains’ engagement.

The transition towards Data Mesh can be tough if you don’t have engaged and motivated teams. It’s perfectly normal to be scared of the unknown, to be worried about getting out of that comfort zone, to question why and how this big change will transform an organization.

Read the article and find out how to address your journey towards Data Mesh.

READ NOW

 

Data Innovation Summit 2022 is coming!
The next 5th and 6th May, we will be in Stockholm for the 7th edition of the most influential Data, AI and Advanced Analytics event in the Nordic Region.
It will be an extraordinary opportunity to see Data Mesh boost
, in action.

LEARN MORE

 

 

How does the Customer 360 view fit into the Data Mesh paradigm?
To answer the question, read the article by Roberto Coluccio, Data Architect at Agile Lab, who suggests some practical approaches to deal with this issue.

READ NOW

 

 

For the second year in a row, the Financial Times and Statista, ranked Agile Lab in the FT 1000, the list of Europe’s Fastest-Growing Companies.

LEARN MORE

 

 

 

We are always looking for the best talent on the market!
Check out our open positions and find the one that’s right for you.

 

Mentoring, coaching, or sponsoring?
Mentoring and coaching activities look similar, but the impetus is different.
In mentoring relationships, the mentee usually sets the agenda. In a coaching relationship, the coach usually sets the agenda. 
Sponsorship is very different. A sponsor uses their authority and influence to create opportunities and recognition to advance someone’s career.
Have you ever thought about this?

LEARN MORE

 

 

Data Engineering is a practice, not (just) a role…..
If you’re interested in finding out more about Elite Data Engineering, read the article by Paolo Platter, Agile Lab CTO.

READ NOW
 

Join us at the Recruiting Day Verona 2022

Talents meet opportunities!

From 2 to 8 May 2022, don’t miss the opportunity to meet us at RECRUITING DAY VERONA .

Organised by the University of Verona and the Chamber of Commerce of Verona, this is the fourth completely digital edition!

We will be there, join us!

To learn more, visit the website: https://www.recruitingverona.it/