FinOps

FinOps Architecture for Data Products

Oct 23, 2024

FinOps Cost Savings

This article wants to provide a practical reference architecture to manage costs for data management platforms based on the concept of data products.

FinOps is a discipline to control, monitor and regulate cost consumption in the cloud. The holistic vision of this paradigm is particularly complex to materialise in practice although the market starts responding to the need for cost management with more or less mature tools. As usual, practices go over technology, thus having tools does not help so much in building a foundation for cost management for a data platform.

That’s why I am committed to analyzing FinOps in the analytical plane to provide effective guidance for data platform architecture design.

Platform and Cost Allocation Unit

A platform is a tool(kit) that solves a class of problems through the minimum effort for a class of users.

In the data space, an Enterprise Data Platform implements a data management paradigm like DWH, Data Mesh, Lakehouse, etc.

Those paradigms resolve a class of similar problems providing capabilities to deploy elements like star schemas, data products, data sets, etc.

Those elements are the minimum deployable unit (architecture quantum) that can be composed to serve a use case. Data modeling plays a crucial role in reusability in any of those cases.

Cost Allocation Unit

Although the cost of an architecture quantum should be uniquely attributed to an owner, that is not always the case. In the context of FinOps, let’s name Cost Allocation Unit the minimum set of cloud resources whose cloud spending must be assigned to an owner. This set is not necessarily bounded by an architecture quantum.

Bounded Ownership Design Principle

The Bounded Ownership Design Principle (BODP) accounts for an architecture where data, metadata, software and infrastructure are bounded by ownership, including their cost consumption. Specifically in the context of FinOps, this means we want to aim for an architecture that makes it easy to control and monitor cost consumption with the right cost attribution.

Owned and shared resources

A Cost Allocation Unit can contain:

Owned resources: resources instantiated exclusively for this architecture quantum. Cost consumption resulting from those resources can be attributed directly to the owner.
Shared resources: resources shared among several architecture quanta. Cost consumption is not known explicitly, it must be derived by parceling the cost of the shared resources.

Cost attribution models

The architecture quantum does not necessarily match a Cost Allocation Unit. In fact, several models can be found in real situations that put an Architecture Quantum in relation to a Cost Allocation Unit :

1_Cost Attribution Model_FinOps Architecture for Data Products

Broken cost attribution: a single architecture quantum cost consumption is split across different personas or organizational units. For instance, DAMA-based data governance requires a certain amount of stewardship in charge of the metadata curation of someone else’s data. This means that a single architecture quantum is at least split into metadata, data infrastructure and software, (systems) on average. The cost of metadata curation (Enterprise Data Catalog) is not attributed to the producer data team but it is probably bound to the IT. Similar considerations are valid for data, software and infrastructure.
Ownership bounded cost attribution: a single clear ownership is attributed to a single architecture quantum. This is rare but it should be the rule since it provides maximum flexibility and clear ownership. A well-done data product design should favor this cost attribution model.
Blurred ownership cost attribution: a single Cost Allocation Unit is responsible for the cost of multiple architecture quanta. This model cannot parcel costs by a single architecture quantum.

Conflicts of ownership

The ownership of data, systems (infra+sw), and costs do not necessarily match.
On the contrary, costs are usually attributed to IT departments while curation of data quality, data semantics, and lineage are assigned to stewardship roles implementing the DAMA framework. This configuration does not provide a direct attribution of cost to data stewardship roles and they are perceived as customers.

If you consider a Data Mesh implementation, you would probably like to bind data product teams with full ownership of systems, data, and metadata also including operational expenses. This is the typical situation for a broken cost attribution model. In general, an ownership model should account for data, metadata, systems, and their TCO.

2_Traditional Data Management vs Data Mesh ownership models (high level)_FinOps Architecture for Data Products

This means that we must specify which kind of ownership we are talking about: ownership of things or ownership to run things.

Infrastructure breakpoint

The ownership-bounded cost attribution model should aim at exclusive cloud resources for an architecture quantum. However, this is not always possible. One of the strongest limitations is given by the infrastructure breakpoint. As the number of architecture quanta grows, we should always check which hard limits we are hitting in the cloud infrastructure because they are going to determine the maximum population of architecture quanta sustained by the platform itself. Consequently, the platform design might opt for shared rather than owned resources.

Cost consumption model

On average, an architecture quantum contains both owned and shared resources. Cost consumption for a Cost Allocation Unit can be determined by direct costs of owned resources and cost chargeback of shared resources.

4_Owned vs shared resources_FinOps Architecture for Data Products

Reference Architecture

A reference architecture to apply FinOps to a data management platform requires the following set of functions:

Application Telemetry,

This is a foundational capability of every platform. It enables every piece of platform (including architecture quanta) to ingest system-related measures. Measures refer to any live data associated with the behavior of data, metadata, and systems like generated costs, consumed vcpus and vmemory, bandwidth, storage, iops, number of objects in the catalog, etc.
Cost Ingestion

It refers to the ingestion of cost measures, and cost aggregation on the basis of cost governance rules. For instance, monthly aggregation or grouping costs by specific source-based criteria.

Cost ingestion is a critical operation. Existing cost management tools are oriented to integrate cloud provider services first but it is not common to find native or easy integration with cost-related measures from enterprise-level tools in the data space. For instance, Dremio, Starburst, Snowflake, Databricks, and Cloudera (and many others) are popular platforms with poor integration in the cost management landscape. This introduces a limitation of the potential of FinOps practices in the data space. In some cases (see Databricks with Azure Costs Management) a direct integration shows all its benefits.

6_Reference Cost Management APIs for Popular Data Platforms_FinOps Architecture for Data Products

Cost Engine

It converts resource consumption to cost consumption. There are two main goals for the Cost Engine:
1) measuring costs
2) estimating costs

A Cost Engine can retrieve cost measures from the Application Telemetry layer of the platform or directly from the cloud provider service or technology.
The Cost Engine is responsible for cost computation on the basis of source-aligned rules. This means that the Cost Engine does not necessarily know how costs must be aggregated to be attributed to the right granularity of the organization. It focuses on how to determine the cost of single resources aggregated at most by the Cost Allocation Unit.

Since cost management tools are usually well integrated with cloud providers, they are natively able to support cloud-native resource tagging. Depending on the cost management tools, it could be missing an abstraction layer for tag governance. For instance, Confluent, AWS MSK, and other Kafka-based services are billed on a per-cluster basis. This requires the Application Telemetry to enrich resource consumption data with tracking information of the Cost Allocation Unit to allow the Cost Engine to parcel the total cost of shared resources through this application-specific tagging. Any other data technology used as a shared service among multiple Cost Allocation Units needs the same ability to track consumption through application-specific tagging.

7_Kafka-based services — Cost Management APIs_FinOps Architecture for Data Products

Price Ingestion

It has similar concerns as Cost Ingestion due to the diversity of technologies that a data platform could need to integrate prices with. In addition, prices are not always public, available by APIs, or tooling. That requires the capability for the Pricing Ingestion module to configure price lists, and SKUs, and link them to cloud or application-specific resources through source-aligned tagging.
Cost Attribution

Aggregates costs based on organizational criteria (target-aligned tagging). The Cost Allocation Unit is the starting point for the data management organization. The Cost Attribution module has organizational knowledge and must be able to map and aggregate costs by organizational needs.

Granularity

It looks like we have several levels of granularity from source to target-aligned tagging.

Depending on the definition of a Cost Allocation Unit for a specific company, tagging can follow different assignment policies:

Direct assignment: cloud or application resources can be assigned directly to a Cost Allocation Unit.
Indirect assignment: shared resources cannot be associated to a Cost Allocation Unit but we have elements to determine this association given the total cost of the shared resources (or their price), instrumentation, and resource consumption measures.
Rule-based assignment: there is no way to attribute costs to the Cost Allocation Unit for a shared resource. In this case, we must define a rule to split the cost and this cannot be based on source-aligned tagging.
For instance, if we are not able to parcel the cost of a specific shared resource across different Cost Allocation Units, we may adopt several measures

Equal cost attribution:

RuleBasedCost(CAU)=Cost(Shared resource)/N_{CAU}

Cost-weighted attribution:

RuleBasedCost(CAU_i)=Cost(Shared resource)*[DirectCost(CAU_i)+Indirect cost(CAU_i)]/SUM_j[DirectCost(CAU_j)+Indirect cost(CAU_j)]

Resource-weighted attribution:

RuleBasedCost(CAU_i)=Cost(Shared resource)*Resource_X(CAU_i)/SUM_j[Resource_X(CAU_j)]

As the reader can notice, this approach requires the Cost Engine to compute those artificial costs of CAUs based on heuristics. In fact, we are not measuring costs, we are formulating a cost attribution ourselves. The Cost Attribution module shall aggregate those costs through target-aligned tagging.

Moving from architecture quantum to data product

An architecture quantum is a data product that delivers a data model that makes sense to the business. This excludes temporary or technical tables, or intermediate aggregations or any piece of data that cannot be directly used as a product. Technical reusability is not a good reason to build a data product and this sets data products apart from regular data modeling for data lakes. A data lakehouse architecture usually has an ingestion layer to host raw data. Raw data mirrors tables from source databases and such representation does not necessarily work for the business and a certain degree of aggregation and denormalization should be applied.

9_Data product guaranteeing a full certification process_FinOps Architecture for Data Products

Product thinking requires a data product to guarantee a full certification process for data delivered. This certification process is provided by the data governance. Whenever this process is automated, we apply computational governance to data products. A data product provides the bounded context for a data model that can be delivered through as many output ports (aka data contracts) as the necessary data access patterns suitable for consumption (streaming, batch, interactive workloads). The relevance of data modeling in the cost management field is strongly underestimated but it is key to optimizing data reusability, while usually, software reusability is the major concern that brings to technical data modelling.

Data Products establish a clear ownership principle around a data model and everything necessary to make it consumable. This is strictly related to how we can implement cost attribution to build solid cost centers and charge-back policies.

Conclusion

This article proposes a reference architecture layout for FinOps applied to Data Product-based platforms. I’ve established a terminology to clearly define resource ownership and cost attribution models.

I’ve proposed the Bounded Ownership Design Principle (BODP) as an ownership principle that links the cost ownership between data and the respective metadata, software and infrastructure necessary to make that data consumable, while the Cost Allocation Unit concept represents the minimum unit of cost recognised by a company and this is specific for each company. Depending on the overlap between BODP and the Cost Allocation Unit, I argue that there are concerns about how cost attribution can be sufficiently granular to target a company’s goals.

This article addresses the consumption model between the platform and data products, while I’ve decided to keep the consumption model between data products out of the scope of this discussion most of the arguments can be repeated also in this case.