A good data catalog implementation is the keystone of any successful data platform project. Do it wrong and live to regret it. Unfortunately, often enough, this becomes apparent only in the later stages of platform development when adoption struggles to take off because users cannot find(or trust) the data they need.
The pain is real. So much so that the market proposes new products every year, each promising to be the right fix for all your issues, and choosing the right one is becoming increasingly difficult.
In this article series, I’ll walk you through the main concepts around working with catalogs, build some foundational knowledge on the topic, and later provide some methodology to make an informed choice when it comes to tool selection.
In this article specifically, we’ll explore the main strategies that can be adopted to bring metadata into the catalog with a focus on the advantages and setbacks of each one. The idea is to provide the reader with all the tools needed to define the best mix of techniques for their platform endeavour.
Here's Part 1 if you haven't seen it or just want a quick refresh.
Once you know how the metadata is going to be organized within the catalog metastore, it’s time to understand how to bring metadata in.
The possible approaches fall into two main categories Pull and Push. In this article, we will go through some of the many nuances of both paradigms, but this is in no way exhaustive. Almost every combination of these approaches could make sense in the right context, so these architectures are intended as archetypes from which you can derive your own.
Traditionally, catalogs work with a pull approach, which means they establish connections with the different data sources composing the datascape and crawl the metadata of all the assets they find.
Data sources can rely on many different technologies, big enterprises usually have a great deal of them spanning across decades of evolution in the IT industry, and when using a pull paradigm the catalog will need a special connector for each one of these technologies.
Different catalog vendors can support more or fewer connectors, but It’s important to understand that no vendor will ever support all technologies that are around. Whenever a technology is not supported by a standard connector there are a couple of alternatives that can be used.
Catalogs that allow for custom plugins can help you deal with very old or niche technologies used by your enterprise, but this does not scale well since you’ll have to pay for the development and maintenance of the plugin
If the data source exposes a suitable API layer and the catalog is extensible, you can create a custom connector that relies on the data source APIs to retrieve the asset metadata. The problem with this solution is that the plugin won’t be supported by the vendor, and you will have to go through an expensive process of development and testing to ensure the fidelity of the extracted metadata.
Whenever the catalog is upgraded by a major version there will be the risk of having to deal with breaking changes, this will involve further maintenance costs and/or unexpected disruptions.
If the catalog tool you are using is an open-source project and your custom connector could be incubated into the main project (and benefit from the maintenance of the entire community) this option becomes much more convenient.
A last resort trick to avoid custom plugins is to mirror an unsupported source into a supported one
Another possible workaround is to mirror the source into another read-only data store based on supported technology, the mirrored data store will be used as a source for the catalog. The drawback of this option is that you will need to set up a mirroring infrastructure and an additional data store, costs might be prohibitive for very large DBs. In general, it’s not a best practice to duplicate data, so this option should be considered a last resort for extreme cases where all other alternatives are not an option.
More recently, some catalogs started supporting the push of metadata from the source towards them. This approach is very interesting for modern data platforms because it allows for better governance and metadata freshness.
The feeding architecture can be based on a streaming platform where metadata changes in the data source trigger the generation of an event that is sent into an event log (such as a Kafka topic). The catalog, being subscribed to that event log, will dequeue it and update it accordingly.
Setting up such an infrastructure for all data sources can be expensive, so this is usually done only if justified by functional constraints (such as ultra-fresh metadata). Furthermore, the catalog product needs to support metadata updates via subscription to a topic.
In a data platform with strong computational governance, source metadata cannot be changed without going through a pipeline of governance checks. For platforms of these kinds, the push mechanism is a very good choice because it allows bringing into the catalog only compliant metadata (and more generally only compliant assets in the platform).
For such an architecture to be feasible the catalog needs to expose an API layer that allows for metadata to be pushed. The same architecture can be mixed up with the streaming approach.
The metadata ingestion patterns we discussed so far are mainly technical and come directly from the data source. Actually, catalogs may contain also data that are either manually added or derived from those coming from the source. I call this class of metadata enriching metadata.
Common examples are sensitivity and confidentiality classification, quality score, ownership info, tags, consumer-liking score and many more.
Enriching metadata could be obtained via automatic workflows triggered within the catalog itself (if supported), or externally via webhook. In other cases, this metadata could be manually added by the data asset owner, but for this to be effective, a process needs to be established and enforced in the entire enterprise. This might prove more challenging than creating automatic routines, thus automatic management should be favoured whenever possible.
When it comes to choosing a feeding approach there is no silver bullet, having a unified feeding mechanism across the entire platform is generally desirable but not always possible. Designing the best solution must take into consideration the following drivers:
A summary of how PULL vs PUSH compare on the main drivers
In this article, I showed the main approaches that can be used to feed a catalog. As previously said, this overview is not exhaustive at all, but it could be of guidance for whoever is embarking on the journey of integrating a catalog into its enterprise data ecosystem.
In Part 3, we are going to discuss how catalog could be leveraged to add a Semantic Layer to your datascape.