A good data catalog implementation is the keystone of any successful platformization project. Do it wrong and live to regret it. Unfortunately, often enough, this becomes apparent only in the later stages of a platform development when adoption struggles to take off because users cannot find(or trust) the data they need.
The pain is real. So much so that the market proposes new products every year, each promising to be the right fix for all your issues, and choosing the right one is becoming increasingly difficult.
In this article series, I’ll walk you through the main concepts around working with catalogs, build some foundational knowledge on the topic, and later provide some methodology to make an informed choice when it comes to tool selection.
This article is part of a wider series on data catalogs.
Here's Part 1 if you haven't seen it or just want a quick refresh.
Here's Part 2 if you haven't seen it or just want a quick refresh.
Data assets (and the catalog itself) are created and maintained by technical people who largely ignore the business relevance of the data they own.
At the same time, these data are consumed by businesspeople, for whom the technicalities associated with assets are meaningless and confusing.
Even worse than that, business people from different areas can (and do) use the same terminology with different meanings, resulting in misunderstandings in the best-case scenario and (unknowingly) unsound analysis in the worst.
Demand and offer meet easily when merchandise can be exposed.
Just like an everyday street market, the platform catalog is designed to match demand and supply.
And just like a market, a platform is frequented by people speaking all kinds of different languages (jargon). Street markets go around language barriers by showing directly the merchandise up for sale, vision enables communication between Buyers and Sellers. A data market, though, is different; verbal language is the only possible enabler for communication, so bridging the jargon gap between producer and consumer is an absolute necessity.
In a data market, how is a consumer supposed to ask the producer for the data asset they need? How is the consumer even supposed to know if a producer has what he needs? Or if the asset he needs exists at all?
A semantic layer is an abstraction layer sitting between actual data and data consumers that translates data assets into terms that are meaningful to their target audience.
Providing a list of business terms to data producers enables them to map appropriate terms to data assets right when they are produced.
Semantic layers can take up different forms; until recently, a way to implement them was by defining data marts or OLAP cubes. They work well in enabling analysis for use cases you know in advance. But they can hardly be considered enablers for data discovery.
In more recent years, the industry shifted toward defining a network of well-defined terms stored in central repositories (glossaries) and using them as tags for data assets. This seemingly simple approach can be pretty complex to set up and maintain, but if done right is well worth the effort.
A business glossary is a repository of business terms.
Business terms are specific to a business area, or domain, and their meaning can vary considerably across different domains.
This is very important since it tells us that a domain-oriented organization is beneficial for the definition of a business glossary. If you are building a data mesh, this is good news since your datascape will already be organized in domains.
Every domain must map its jargon into a glossary. Terms not specific to a subdomain are inherited from the parent domain. A subdomain can have a more stringent, or orthogonal, definition of the same term.
Terms within a domain must be unique, that is, they are not ambiguous. Term collisions might be a symptom that a domain is either too extensional (its perimeter extends over multiple domains) or too intensional (it includes both generic and very specific definitions of concepts). In the first case, the solution could be to split a domain into two(same-level) domains; in the latter, it could be to create a new(more specific) subdomain.
The federation of all domain glossaries constitutes the global glossary.
In their simplest form, business terms are just a label and a definition. For example:
While having this kind of mapping between words and meanings can improve the understanding of data for the consumers, this won’t be enough to improve data discoverability. This is because users cannot be expected to always know in advance the exact term associated with a concept within a domain. Users will often use variant terms (synonyms) or related terms, and an effective catalog must be able to suggest meaningful search results also in these cases.
For each term, we could even specify a list of broader and narrower terms, adding a hierarchical dimension that goes from general to specific.
Our previous example could then be modelled in a way similar to this:
Or more generally:
A proper semantic layer is not made by just a list of business terms' definitions; instead, it maps out the existence and quality of the relationships between terms.
A glossary where these relationships are consistently established is the foundation of a knowledge graph. Knowledge graphs are at the heart of most search engines and recommendation systems, and will probably play a crucial role in data catalogs in the future. We can already see a trend in this direction, but not all catalog tools allow for such a complex definition of the glossary.
Glossaries can be governed or free.
A free glossary is very similar to the hashtag system on social networks; anyone can create a new one or use an already existing one. They are all indexed for search, but there is no relationship between them. Free glossary items are basically tags.
Free glossaries are an important feature for a semantic layer because they add flexibility to the process of tagging assets with the right terms as soon as possible.
A catalog that supports free glossaries can enable consumers, or a restricted set of user personas, to add tags on the fly on data assets to improve search collaboratively. These tags can later become official business terms or be temporarily used to mark assets in a period with special necessities.
While free glossaries are populated directly by data producers, maintaining domain glossaries requires some work.
Glossary ownership often falls on the shoulders of Domain Owners, who are supposed to be the best positioned to organize a clear view of the domain terminology. Of course, to make this feasible, the glossary definition activities need to be distributed among the domain members and just be centrally governed by the domain owner.
This is made possible by the definition of a business term lifecycle(and the related automation).
Lifecycle must account for the possibility of creating and modifying business terms in a distributed fashion while letting them validate and publish these terms into the glossary only to a restricted set of people.
Business terms should also be able to be deprecated and eventually dismissed without causing disruption.
The semantic layer is only useful if data assets are associated with their respective business terms. This association cannot be done automatically, so a process needs to be established.
Association between business terms and data assets can happen at deployment time through an automated pipeline or at any later time by enriching the assets directly from the catalog UI. Generally, the latter is preferred, since the asset deployment is performed by technical people who are not aware of what terms should be associated with an asset.
Search won’t work well without a consistent semantic layer applied to the entire datascape. Setting governance policies that force data owners(or domain owners) to enrich the asset with semantic metadata as soon as possible is a good way to keep metadata quality high and the catalog search effective.
As often happens, there is no general approach that can satisfy all cases. The governance team and the platform architect must carefully consider the needs and limitations of their platform and users and design this process around it.
In this article, I discussed the role of a semantic layer in a modern data platform and how a catalog is an enabler for that.
In the next article, we’ll discuss other benefits brought by catalog tools such as data lineage, data quality, and many more.
Stay tuned!