Modeling a Data Platform with Data Platform Shaper

Written by David Greco | Sep 26, 2024 3:43:21 PM

Modeling a data platform

In a previous post, we presented our research project, Data Platform Shaper (DPS), which we used as a laboratory to investigate new ideas around flexible metadata catalogs. We tried to make DPS as flexible and expressive as possible during our investigation. A sound data platform design process that aims to introduce good engineering practices drives DPS development; this process could be quickly summarized in the following phases, as outlined in the previous post:

Gathering user requirements.
User requirements analysis and logical model definition.
Defining the logical assets inside the catalog.
The compilation/transformation rules define the corresponding physical assets inside the catalog.
Defining the provisioning tasks for deploying physical assets on the target technologies.

This post shows a new DPS feature that allows us to better adhere to the above process.

Let’s suppose we want to build a data mesh so that after the process phases above, we would come up with the following overall model:

The figure above is a pictorial representation of a data mesh across different models.

The conceptual model defines the kind of assets a data mesh contains and their possible relationships. So, in the figure, the conceptual model includes the following assets:

DataProduct is the fundamental asset in a data mesh, which can be considered a collection of connected data products. For the sake of simplicity, in our example, a DataProduct contains a list of OutputPort.
OutputPort with its variants: FileBasedOutputPort and TableBasedOutputPort.
FileBasedOutputPort conceptually models a data collection stored as a collection of files stored somewhere.
TableBasedOutputPort conceptually models a data collection exposed as an SQL table.

In DPS, the conceptual model can be expressed in terms of traits and their relationships. So, we can create a corresponding trait for each asset above.

Once the assets/traits are defined, we must determine their relationships.

A DataProduct has a hasPart (with its opposite partOf) relationship with OutputPort, which models that a data product contains a list of output ports.

A TableBasedOutputPort has a dependsOn relationship with a FileBasedOutputPort. This relationship models the typical situation where the computational layer, i.e., a SQL engine, is separated from the storage layer, i.e., object storage. In this context, we model it in such a way that a table is mapped on top of a collection of stored files using a specialized SQL engine.

Now, we can move to the logical model. We must define the actual types with their attributes, making the abstract concept described in the conceptual model concrete.

In DPS, the logical model is defined in terms of types associated with specific traits to determine their nature and behavior.

In this case, it’s pretty simple; we define the following types:

DataProductType is a type with a hasTrait relationship with the DataProduct trait. It’s a concrete type with the attributes name and domain.
FileBasedOutputPortType is a type with a hasTrait relationship with the trait FileBasedOutputPort. It’s a concrete type with the attribute name.
TableBasedOutputPortType is a type with a hasTrait relationship with the trait TableBasedOutputPort. It’s a concrete type with the attribute name.

In DPS, the relationships are automatically inherited from the traits, so we don’t need to define them explicitly in the logical model.

Once we define the logical model, we need to determine the physical model. The physical model represents how the data platform assets are implemented with a specific technology. In DPS, we defined a special relationship mappedTo to describe this mapping type.

For example, a FileBasedOutputPortType needs a place to store its files, so if the target technology is AWS, we could define a type S3FolderType with an attribute path and associate the FileBasedOutputPortType with S3FolderType with the special relationship mappedTo.

Similarly, a TableBasedOutputPortType needs an SQL engine capable of creating a table on top of stored files; in the AWS ecosystem, we could choose Athena and create a specific type AthenaTableType with three attributes: dataPath, database, and table. Then, we can map TableBasedOutputPortType to AthenaTableType with the mappedTo relationship.

The mapping mechanism

In DPS, the mappedTo relationship is associated with the mapping feature. First of all, the mappedTo relationship is a ternary relationship with the following structure:

A SourceType is mapped to a TargetType, which means that every time we create an instance of the SourceType, an instance of TargetType is automatically created and linked to the source instance by a mappedTo relationship.

The mapper

The mapper is a particular instance of the TargetType where its attributes contain expressions used by DPS to compute the attribute values of the target instance. During the mapping evaluation, DPS injects a reference of the source instance into the expression engine, so it’s possible to define the mapper expressions in terms of the values of the source instance attributes. Additionally, DPS is also able to inject any reference to instances that can be reached from the source instance. In this way, it’s also possible to compute attribute values, taking into account the surrounding instances.

In our example, the mapper associated with the mappedTo relationship between FileBasedOutputPortType and S3FolderType contains only one expression attribute:

path: “dataproduct.get(‘domain’) += ‘/’ += dataproduct.get(‘name’)”

In the expression above, dataproduct is an alias for the instance of DataProductType that can be reached through this path:

source/partOf/DataProductType

where source is the source instance in the mapping (an instance of FileBasedOutputPortType). This expression simply computes the folder path as <dataproduct_domain>/<data_product_name>.

The other mapper associated with the mappedTo relationship between TableBasedOutputPortType and AthenaTableType contains three expression attributes:

table: “source.get(‘name’)”

The table name is simply the name of the TableBasedOutputPortType.

database: “dataproduct.get(‘domain’) += ‘.’ += dataproduct.get(‘name’)”

The database name is defined as <dataproduct_domain>.<data_product_name>.

dataPath: “s3folder.get(‘path’) += ‘/’ += ‘outputport’ += ‘/’ += source.get(‘name’)”

The data path, where the table files are stored, is defined as <s3_folder_path>/outputport/<outputport_name>. Where s3folder is an alias for the instance at the path:

source/dependsOn/FileBasedOutputPortType/mappedTo/S3FolderType

That path shows the usage of the dependsOn relationship. In our example, an instance of TableBasedOutputPortType depends on an instance of FileBasedOutputPortType since its attributes can only be derived from the attribute values the instance depends on.

The full example, particularly the file DataPlatformShape.yaml, contains all the DPS definitions for creating the model above, and it can be found in the project directory https://github.com/agile-lab-dev/data-platform-shaper/tree/main/examples/datamesh/simple.

Conclusion

Data Platform Shaper has been a fantastic project in which we had the opportunity to experiment with new ideas. Most of those ideas have been gradually incorporated into my company's flagship product, Witboost.

We got excellent feedback from the scientific community with our paper and from the Witboost development team; for us, this is an essential confirmation that a modern data platform can be effectively managed only by adopting a flexible metadata catalog capable of modeling the complexity of all the assets it’s based upon.

View full post