July 2022 – Agile Lab News

What’s in store  

July 2022

Welcome to the summer issue of the Agile Lab’s newsletter!

Given these hot days, we will keep it short but you will still be up-to-date about us ☀

These are exciting days for us since we are preparing our trip for Big Data London 2022! On September 21-22, Agile Lab is going to sponsor the UK’s leading data & analytics conference and exhibition, hosting leading data and analytics experts. We are going to have a talk in the Data Mesh Theatre and we hope to meet you at our stand 559! If you’d like to know more, go to the event website and reserve your ticket. It’s free!

🧳 Are you ready for your summer trip too? While packing you might want to hear about some of our success cases.

Data Mesh is not just a matter of technology or data platform, it is a process that involves all the company, working on cross-functional teams in different domains. If you want to learn more about the new paradigm and how data-driven organizations lead their Data Mesh journey, take a look at our website to listen directly from top companies, how they succeeded in their data mesh evolution.

We are always looking for the best talents on the market!
Check out our open positions and find the one that’s right for you.

Stay tuned!

Sblocca il potenziale dei Big Data!

Since 2014, Agile Lab has been creating value for its customers in data-intensive environments, harnessing the power of highly innovative technologies and enabling services.

In the article, the two founders talk about the growth milestones achieved in recent years and the collaboration with prestigious companies, such as Vodafone Automotive.
Francesco Fabrizi, CTO at Vodafone Automotive, explains how the company has accelerated its transformation by leveraging the potential of Big Data with Agile Lab.

 

That’s why we need to stop talking technology – and start focusing on the real value of data

Few topics have been supported and soaked so much in today’s organizations as how to best take care of their data. Business leaders are often good at talking about how the organization should make better use of its data to meet the problems it faces – but the problem is that it is often a game for the gallery. The result is frustrated IT departments that experience a lack of mandates and tools, and a company management that is eventually forced to realize that they do not operate the ship modernly.

By Paolo Platter – Agile Lab CTO & Co-founder

 

May 2022 – Agile Lab News

What’s in store  

May 2022

Agile Lab has developed Data Mesh Boost, a ready-to-use solution to manage the entire life cycle of your Data Products: starting from the creation and ending with the consumption, speeding up the time to market and lowering implementation costs.
Discover more!

Data Product Specification
Roberto Coluccio, Big Data Architect & Project Lead at Agile Lab, talks about a practical approach we are using with some important customers of ours to enable the automation needed by their platform supporting their Data Mesh. Read the article!

Learn how Enel Group worked with Agile Lab to implement Dremio as a data mesh solution for providing broad access to a unified view of their data, and how they use that architecture to enable a multitude of use cases.
Register soon!

Big Data & AI Torino Group is back with a new HYBRID MeetUp! 🎉 On the next 14th of June, with Emanuele Maffeo, Senior Big Data Engineer at Agile Lab, and Mattia Zeni, Solutions Architect Data & AI in Databricks, we will talk about how to implement data pipelines using Azure Databricks and about Data Lakehouse Platform with a focus on Delta.
Reserve your seat or join remotely!

Last month we officially reached the goal of the 💯th employee!
We are immensely proud of our ever-growing team, which brought in years of experience and expertise to grow Agile Lab. Their commitment and dedication drive our success!

But we do not want to stop!
We are always looking for the best talents on the market and we give you the Top 10 reasons to choose us!
Check out our open positions and find the one that’s right for you.

Stay tuned!

Meetup – Data Quality with Azure Databricks & Databricks Data Lakehouse & Delta – 14th June 2022 | 6,30 PM CET

Big Data & AI Torino Group is back with a new HYBRID meetup!

On the next 14th of June, with Emanuele Maffeo, Senior Big Data Engineer at Agile Lab, and Mattia Zeni, Solutions Architect Data & AI a Databricks, we will talk about how to implement data pipelines using Azure Databricks and about Data Lakehouse Platform with a focus on Delta.

Learn more and reserve your seat now!

How Enel Group built a data mesh architecture with Dremio and Agile Lab – June 9, 2022 | 9 AM CEST

Data teams are facing greater demand for data to satisfy a wide range of analytic use cases. But providing access to all of their data across business units, regions, and cloud environments is a major challenge.

In this webinar, with:

Nicolò Bidotti, Big Data Architect at Agile Lab

Jeremiah Morrow, Partner Solution Marketing Director at Dremio

Achille Barbieri, Senior Project Manager at ENEL

we will talk about how Enel Group, one of the world’s largest electric utility companies, partnered with Agile Lab to implement Dremio as a data mesh solution, and provided data consumers with a unified view of their data.

Don’t miss! 

Tech&More – Azure Cloud Architectures for Deployment of Data Science Models – May 5, 2022 | 18,30 CEST

On 5th May 2022, at 18,30 CEST, we are going to talk about Azure Cloud architectures for deployment of Data Science models.

With Pierangelo Calanna, Senior AI Engineer | Solutions Architect at Avanade and Vincenzo Cassaro, Big Data Engineer at Agile Skill, we will have a presentation about running data science models on the cloud and the process of designing a secure, scalable and cost-effective solution on Cloud Azure.

Learn more and reserve your seat now!

Data Product Specification: power up your metadata layer and automate your Data Mesh with this practical reference

Data Product Specification: power up your metadata layer and automate your Data Mesh with this practical reference

Just like every informative content in the Data Mesh area should do, I’d like to start quoting the reference article on the topic, by Zhamak Dehghani:

[…] For data to be usable there is an associated set of metadata including data computational documentation, semantic and syntax declaration, quality metrics, etc; metadata that is intrinsic to the data e.g. its semantic definition, and metadata that communicates the traits used by computational governance to implement the expected behavior e.g. access control policies.

A “set of metadata” is hereby to be associated with our Data Products. I’ve read a lot of articles (I’m gonna report some references along the way in the article) about why it’s important and what theoretically can be achieved by leveraging them… they all make a lot of sense. But — as frequently happens when someone talks about Data Mesh related aspects — there’s often a lack of “SO WHAT”.

For this reason, I’d like to share, after a brief introduction, a practical approach we are proposing at Agile Lab. It’s open-source, it’s absolutely not perfect for every use case, but it’s evolving along with our real-world experience of driving enterprise customers’ Data Mesh journeys. Most importantly, it’s currently used — in production, not just on articles — at some important customers of ours to enable the automation needed by their platform supporting their Data Mesh.

Why metadata are so important

I believe that automation is the only way to survive in a jungle of domains willing to create Data Products. Automation with respect to:

  1. easy and fast creation of Data Products leveraging pre-built reusable templates that guarantee out-of-the-box compliance to the Data-as-a-Product principle;
  2. implementation and execution of the Federated Governance policies (which therefore become “computational”);
  3. enablement of Self-Service Infrastructure, as a fundamental pillar to provide Decentralized Domains the autonomy and ownership they need.

The pillars together are all mandatory to reach the scale. They are complex to set up and spread, both as cultural and technical requirements, but they are foundational if we want to build a Data Mesh and harvest that so-widely promised value from your data.

But how can we get to this level of automation?
By leveraging a metadata model representing/associated with a Data Product.

Such a model must be general, technology agnostic, and imagined as the key enabler for the automation of a Data Mesh platform (to be intended as an ecosystem of components and tools taking actions based on specific sections of this metadata specification, which must be standardized to allow interoperability across the platform’s components and services that need to interact with each other).

State of the art

No clear sight due to an Atlantic perturbation — 2014, Madeira Island.
Photo by Roberto Coluccio — all rights reserved.

In order to provide a broader view on the topic, I think it’s important to report some references to understand what brought us to do what we did (OK, now I’m creating hype on purpose 😁).

  • Feed the Control Ports: Arne Roßmann in this article describes how control ports of Data Products should serve foundational APIs by leveraging metadata collected at different levels.
  • Active Metadata: Prukalpa Sankar in this article mentions a commercial product tackling the active metadata space as a way to autonomously collect precious pieces of information about several things from SQL plans to understand data lineage, to monitor Data Products update rate, and other very cool stuff.
  • Metadata types: Zeena in this post introduces their Data Catalog solution after reporting a very interesting differentiation of metadata into DESCRIPTIVE, STRUCTURAL, ADMINISTRATIVE.
  • Standards for Metadata in generalOpenMetadata is doing excellent work in creating standards and a framework for metadata management, including specific vocabulary and syntax (even if in some cases it becomes quite complex IMHO).
  • General Data Product definition: what gets closer to our vision is OpenDataProducts, a good work to factor out some important information around the general Data Product concept (e.g. Billing and SLA). However, generalizing too much is risky and, in this case, I believe they got a little too far from the way Data Mesh requires a Data Product to be described and built. For example, there’s no mention of Domains, Schema, Output Ports, or Observability.

According to these references, there’s still no clear metadata-based representation of a Data Product addressing — specifically — the Data Mesh and its automation requirements.

How we intend a data product specification

We believe in an Infrastructure-As-Code declarative, idempotent, and versioned approach. The goal is to have a standardized yet customizable specification addressing this holistic representation of the Data Mesh’s architectural quantum that is the Data Product.

Principles involved:

  • Data Product components (input ports, output ports, workloads, staging areas)
  • self-documenting Data Product (from schema to dependencies, to description and access patterns)
  • clear ownership (from Domain to Data Product team)
  • semantic linking with other Data Products (like polysemes to help improve interoperability and references to build a semantic layer or Knowledge Graph on top of the Data Mesh)
  • data observability
  • SLA/SLO and Data Contracts (expectations, agreements, objectives)
  • Access and usage control

The Data Product Specification aims to gather together crucial pieces of information related to all these aspects, under the strict ownership of the Data Product Owner.

What we did so far

In order to fill up this empty space, we tried to create a Data Product Specification by starting this open-source initiative:

https://github.com/agile-lab-dev/Data-Product-Specification

The repo contains a detailed documentation field-by-field; however, I’d like to point out here some features I believe to be important:

  • One point of extension and — at the same time — standardization is the expectation that some of the fields (like tags) are expressed in the OpenMetadata specification.
  • Depending on the context, several of those fields can be considered optional: their valorization is supposed to be checked by specific federated governance policies.
  • Although there’s an aim for standardization, the model is expandable for future add-ons: your Data Mesh (and platform) will probably keep evolving for a long time and the Data Product Specification is supposed to do the same, along the way. You can start simple, then add pieces by pieces (and related governance policies).
  • being in YAML format it’s easy to parse and validate, e.g. via a JSON parsing library like circe.

The Data Product Specification itself covers the main components of a Data Product:

  • general bounded context and domain metadata the Data Product belongs to (to support descriptive API on control ports)
  • clear ownership references
  • input ports, workloads, internal storage/staging areas, output ports, and logical dependencies between them
  • data observability
  • data documentation, semantic linking info, schema, usage guidelines (from serialization format and access patterns to billing models)

NOTE: what is here presented as a “conceptual” quantum can be (and it is, in some of our real-world implementations) split into its main componing parts that are then git-version controlled under their own repositories (belonging to a common group, which is the Data Product one).

Key benefits

The Data Product Specification is intended to be heavily exploited by a central Data Mesh-enabling platform (as a set of components and tools) for:

  • Data Products creation boost, since templates can be created out of this specification (for the single components and for the whole Data Products). They can then be evolved, also hiding some complexity from end-users.
  • Federated Governance policies execution (e.g. a policy statement could be “the DP description must be at least 300 chars long, the owner must be a valid ActiveDirectory user/group, there must be at least 1 endpoint for Data Observability, the region for the S3 bucket must be US-East-1”, etc…), thus making “approved templates” to be compliant by-design with the Data Mesh principles and policies.
  • Change Management improvement, since specifications are version controlled and well structured, thus allowing to detect breaking changes between different releases.
  • Self-service Infrastructure enablement, since everything a DP needs to be provisioned and deployed for is included in a specific and well-known version of the DP specification, thus allowing a platform to break it down and trigger the automated provisioning and deployment of the necessary storage/compute/other resources.

As I said, this specification is evolving and can be surely improved. Being open-source, every contribution is welcome.

Wrapping up

How can a Data Mesh journey reach the scale? With automation to guarantee out-of-the-box compliance of Data Products to the Data Mesh principles. 

If you’re interested in other articles on the Data Engineering topics, sign up for our newsletter to stay tuned. And don’t hesitate to contact us if you need to address specific issues.

Thank you for reading.

Author: Roberto Coluccio, Agile Lab Data Architect

Meet us at the Politecnico di Milano Career Day 2022

We’re exciting to be at the next Career Day by Politecnico di Milano, on 9th and 10th May 2022! It will take place both in-person, at the Campus Bovisa, and online through an ad hoc platform!

During the 2 days, you will have the chance to meet us directly, or virtually, at our booth, to know our team, learn more about our jobs, activities and life at Agile Lab.

Are you ready? Registrations will be closed on May 5th.

Reserve your seat now!