Blog - Agile Lab

Data Catalogs: The Foundation of Data Platform Success — Part 1

Written by Pietro Sottile | Feb 12, 2025 8:51:53 AM

A good data catalog implementation is the keystone of any successful data platform project. Do it wrong and live to regret it. Unfortunately, often enough, this becomes apparent only in the later stages of platform development when adoption struggles to take off because users cannot find(or trust) the data they need.

The pain is real. So much so that the market proposes new products every year, each promising to be the right fix for all your issues, and choosing the right one is becoming increasingly difficult.

In this article series, I’ll walk you through the main concepts around working with catalogs, build some foundational knowledge on the topic, and later provide some methodology to make an informed choice when it comes to tool selection.

In the first episode of the series, I’ll go through some of the foundational concepts that will help the reader navigate its first steps in working with catalogs.

 

1. Why do data platforms need catalogs?

The core capability of a catalog is searching for data, which is intended as searching for data, or finding out whether and where the needed data exists. This is made possible by collecting metadata that describes each asset existing in your enterprise datascape. In its purest form, a catalog is fundamentally a metadata repository with a search engine on top.

Sounds simple enough, right? Kinda.

An enterprise datascape without a catalog is a lot like a territory without a map, it might be full of precious resources but it will be borderline impossible to find them

Typically, in an enterprise context, the data assets live on different databases that span across many different technologies. Metadata collection can be very challenging because you need to deal with all these data sources with no guarantee that the information extracted across them is homogenous.

Collecting metadata by itself is not enough though, the effectiveness of finding data heavily depends on how well metadata within the catalog are structured. While there is no universal way to correctly structure metadata into a catalog (because every organization is different and has different goals), there are a few universal concepts that can help navigate these waters.

Metadata is structured by means of relationships. Relationships can be of different nature, but they can be put into three categories:

Vertical

In a domain-oriented platform (such as in data mesh), vertical relationships exist between a domain, subdomains and every asset they contain. In a more classic context, there could be a vertical relationship between a department (as well as projects) and a set of assets.

Horizontal

It refers to what is more commonly called Lineage. This relationship tracks where an asset originates from and what other assets it contributes to generate. Horizontal relationships also provide details about the nature of the transformation that happens while data flows across different enterprise systems and domains. It can be done with different levels of granularity, depending on the platform capabilities and necessities (e.g. Lineage between Data Products, and lineage between columns within output ports of the Data Products)

Semantic

It bridges technical (often obscure) names with meaningful and ubiquitous business concepts. Enables end users to seamlessly exploit technical assets autonomously. This overlay of relationships is often the most overlooked because it does not stem directly from the technical components and requires the concerted effort of both business and technical personnel.

To be effective and scalable, the platform must be able to build and evolve this structure automatically based on metadata collection.

While the main goal of a catalog is to find data, in modern platforms the catalog is an enabler also for Data Governance and Data Observability which are the main drivers in obtaining the Holy Grail of our field: Data quality.

The actual reason why catalogs are of utmost importance

 

2. The catalog model

Now that we understand how important it is for metadata to be well-organized and interrelated, we need to know how to define the structure into which we are going to fit the metadata.

This structure is what I call catalog model, and it’s like a graph of all items (nodes) represented in the catalog and the relationships between them (edges).

A very simple example of a catalog model is the one represented in the picture above. Here, the Sales and Marketing domains contain one data asset (in green) and data process(in purple) each.

 

Notice also that the items are connected by relationships with specific meaning, and that a relationship is not meaningful if applied between the wrong items (i.e. a process can consume from a data asset but not from another process).

All catalogs must internally represent the enterprise datascape in a similar fashion, but this can only be achieved when two other prerequisites are met:

  1. Being capable of segregating processes and assets in their respective domains
  2. Having defined a metamodel that fits the purpose and necessities of the company

 

2.1 Domain Organization

Many companies make the mistake of mirroring their internal organization(e.g. departments and teams instead of domains and subdomains) into the catalog model, but this falls short every time the organization changes. Spoiler alert: it happens often!

As of 2025, Domain-Driven Design is the de facto standard for optimal modelling, but your company might have special requirements or constraints that force you to follow other approaches. Whatever the case, having a vertical structure that clusters the items of the catalog into business macro areas is an unavoidable necessity and a pre-requirement to most of the concepts that will follow. For simplicity, I will always refer to the unit composing the vertical structure underlying a data platform as “domains”, whether domain-driven design principles have been applied or not.

An example of domain organization

 

2.2 The Catalog Metamodel

If the catalog model is a like a map of all the enterprise assets, the metamodel is the rulebook according to which such a map is drawn.

The catalog model is composed of specific instances of assets, processes and relationships. All of these constitute a snapshot of the enterprise datascape at a certain moment in time.

The catalog metamodel is composed by the classes of objects that can be represented within a model and the ways they can relate to each other.

 

A metamodel defines:

  • The different kinds of items that can be represented in the catalog, in our example we have Domain, DataProcess and DataAsset, but there could be many more

  • The relationships that can exist between items

A metamodel (sometimes also called ontology) can have any level of complexity, it tends to grow extremely fast by adding more item types. A well-thought and complex metamodel is more expressive than a well-thought but simple one, this added expressiveness comes with the price of increased cognitive load for platform users and maintainers. Determining the right trade-off between value brought by the metamodel expressiveness and incurred cost is a delicate job that is the responsibility of the Platform Data Architect.

Catalog metamodeling is an activity usually either completely ignored or misunderstood. This is unfortunate because a good(and flexible) metamodel can make the whole difference in building an effective catalog. While ideally, platform and catalog ontologies should match, it could make sense (or be unavoidable) to use a different(simplified) catalog metamodel. One possible reason for this is that not all catalog vendors offer metamodel flexibility. Whenever this is true you should be aware of whether your enterprise metamodel could fit into the vendor one.

 

2.2.1 Knowledge graphs

When exploring the market for a catalog tool, you are going to encounter products claiming to be based on top of a Knowledge Graph(KG).

While this is sometimes true, other times it is just the result of a (un)genuine misunderstanding of what a KG-based catalog should be able to do.

KG-based catalogs have the undisputed advantage of allowing limitless flexibility in their metamodel definition. On top of that, they are searchable for any given item (that is modelled as a node) and they can be visualized (they are graphs after all).

The union of:

  • Total modelling flexibility

  • Total searchability

  • The ability to plot both the catalog model and metamodel

These are the characteristics that make a catalog KG-based.

 

2.2.2 Open standards

In the last decades, much work has been done to define common standards to support metadata metamodel definitions. It is especially important to know about these standards if your data platform is supposed to share data with a wide public. In any case, exploring these standards can help you avoid shortcomings and provide a wider perspective on what you actually need and what kind of flexibility you will need in the future.

Here’s an incomplete list of very popular standards:

  • DCAT : particularly useful for open data and FAIR (Findable, Accessible, Interoperable, Reusable) data principles

  • CKAN: Less comprehensive and semantically rich than DCAT but widely used in open data portals such as Data.gov and the UK Data Service

  • DDI: Domain-specific standard, Specialized for survey and social science data

3. Conclusion

In this article, we discussed how and why data catalogs are determinants for data platform initiative success and the foundational knowledge necessary to start working with them.

In the next article we will discuss the most common pattern to bring metadata into the catalog, their advantages and shortcomings.