Master Data Management: challenges and basics

A Master Data Management system is the single point of truth of all data company-wide. The problem we want to manage is related to unifying and harmonizing ambiguous and discordant information coming from multiple sources of the same organization. Every single piece of information can be replicated across several systems and referenced by different identifiers.
Data can be labeled in many ways and take separate routes from the acquisition to the utilization step.

We will proceed here by providing a sample scenario that happens in real cases.

A reference scenario: banking and insurance companies

A banking institution is usually surrounded by myriads of data coming from many different directions.

Typically, we can mention much more than the following divisions within such organizations: Private Banking, Insurance Services, Legal & Compliance, Asset Management, Id Services, Real Estate.

Those areas need to communicate across subsidiaries, external services, and government agencies for regulation purposes.

Each of those items corresponds to a separate data management system that gathers information about a bunch of business entities and turns them into contracts, services, transactions, and whatever is necessary to accomplish the business.

In reality, each division within the organization has a separate life. They are developed by different suppliers, shifted over time, rely on different technologies, they’re managed by people having divergent backgrounds and usually, they don’t even share the same vocabulary BUT they are making all the same business.

A banking/insurance system: different technology stacks, separate data flows, different languages.

How information diverges

From a system perspective, the issue may look merely technical with no consequences over the business. Having distinct technologies it is something that it is likely resolvable through system integration with custom developments or specialized tools.

Nevertheless, the mismatch is deeper than how it appears and it goes from the technical to the cultural level across all divisions of an organization.

The business relies on business objects that have a specific representation within each subsystem they live in.

The representation of the same business object is heterogeneous across the subsystems of the organization.

For instance, a single person may be identified through an email within a marketing campaign, the unique registration id within the Real Estate division, or the fiscal code for the legal and compliance department. Every profile related to that person may have registered email, phone, and even a document id different from each other. That is, the same attribute “email” has different values for the same person creating ambiguities among data within the several subsystems of the organization.

Business models with many ambiguities.

The same ambiguity arises for all the other business objects bringing to models that contain a lack of clarity, misunderstandings, and data difficult to reuse without carrying out lots of mistakes.

Many events can occur on the different subsystems that change those data, for instance, a change of the email, the address of residence, the ID due to renewal, etc.

Events that change of business object attributes

How should we consider the attributes coming from different sources when they are related to the same business objects? This is one of the important questions to answer in the context of Master Data Management.

How data quality affects ambiguity

Data quality matters if our organization wants to keep ambiguity away. Degradation and missing quality gates can increase the possibility to bring mistakes into the data model of the business objects. Consider a fiscal code not validated that contains a single mistaken character. This fiscal code can easily lead to match another person. A mobile number can be reassigned by the telecom company so that multifactor authentication can reduce such degradations, a fake email or phone number can be also validated before being accepted by the system. Different categorization of products by the different departments of the company without cross-system validation is also another issue. There are tons of those examples.

Golden record

The main objective of a Master Data Management system is to keep an up-to-date version of a Golden Record. This is a clear data model of a business object that integrates attributes coming from all subsystems.

Unification and harmonization of business object data models

Matching phase

Updates from any of the subsystems have to be identified to match the Golden Record, that is the MDM system must detect whether this business object instance has already entered the organization in some shape.

Examples of matching records

The matching phase explores a set of rules of several types to match a Golden Record. Those rules can be for exact or fuzzy matching, meaning that sometimes it is necessary to relax the algorithm to account for poor data quality, typos, and misspellings (like Johnny and Jonny). The same rule can aggregate checks for different attributes at the same time to reinforce a test against the Golden Record.

Given a new business object instance entering the organization, the matching phase checks that there is at least a Golden Record that validates any of the rules given in a certain order.

This may seem controversial because there could exist many Golden Records that correspond to a business object instance update. Anyhow, we have to implement a strategy that reduces mistakes and this is the importance given by the order of execution of the rules and the type of matches we apply to a new record.

Where no matches are found, the business object instance is elected as a new Golden Record.

Merging phase

The second point to construct a Golden Record is the merging phase. After a match is found for the incoming business object instance, there is the need for merging this new record with attributes of the Golden Record. Here some of the criteria to be applied.

Source Reliability
As we have mentioned, data quality is relevant to the disambiguation process. In this respect, we can associate a sort of score to each attribute for each business object to sort them out by reliability.
For instance, we can give a not validated email from the marketing subsystem a score of 1 out of 5 while an email coming from the real estate subsystem a score of 5 since it is empowered by multifactor authentication.

Last attributes come first
The most recent data are usually considered more reliable because considered up-to-date.

Group attributes by category
Consider to not replace the Last Name from the new business object instance and keep the Name from the Golden Record, they are substantially a combined attribute and does not make sense to treat them separately.

Challenges and opportunities

There are many challenges supposed to be solved by a Master Data Management system.

Data Privacy

One of the most relevant is data privacy. Managing consent policies across multiple channels (divisions of the same organization) is very hard to deal with. In fact, consider the same customer being involved in a marketing campaign where she/he allows the organization to use her/his data uniquely for the submitted survey while the same customer is going to provide contradictory consents from another channel (branches or agencies). What to do in this case? Worse, think about a group of companies exchanging/sharing customers’ information that must respect regulations like GDPR and HIPAA. How to deal with different or contradictory consents? How to propagate bottom-up and top-down privacy consents across all subsystems? How to segregate duties among divisions with respect to regulations? All this is far from being simple and clear and MDM can really help to sort those things out. Nevertheless, designing such a system is a complex task.

Data access control

Any organization working with a big amount of data has to deal with controlled data access levels. Data stewards, chief data officers, data/ML engineers, CRM people, IT operations, data governance, they all access data but with distinct privileges against data. Thus, data masking and segregation based on roles and duties must be provided at any level and MDM can drive this complex logic.

Batch vs real-time

Customers are greedy for technology. Thus, they want to buy an insurance policy or get a loan approved and at the same time, they wish to monitor the status of any action from a smartphone in real-time. These simple requirements translate into complex subsystem integrations that may take even several days to be available to the final user in a solution based on batch processes. For instance, purchasing a single loan from internet banking services can potentially involve every division in the organization such as Real Estate, legal and compliance, insurance services, etc. Such a requirement may require a shift from batch to real-time architectures that surely need to involve a Master Data Management system as a central asset of the organization and this is far from being trivial (see A Data Lake new era).

360 Customer View

A 360 Customer View focuses on customer’s data analytics providing the organization with a customer-centric view. This is strictly related to the customer business model provided by the MDM and it is able to inspect customer’s needs from his/her perspective giving the business one more extra chance to do better. Obviously, MDM supplies a comprehensive collection of information related to individuals that can be customers or prospects, so representing the principal source of trusted information about an individual.

What’s next

This is just an introductory article I’ve proposed to provide you with intuitions about what a Master Data Management system has to deal with and which are the basic principles behind. Next time, we will go more into technical and functional details of Master Data Management to dig into some of the mentioned challenges and related solutions. Please share your experience and send back your feedback! Follow us on AgileLab.

Written by Ugo Ciracì – Agile Lab Project Lead

 

If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!

Data Quality for Big Data

In today’s data intensive society Big Data applications are becoming more and more common. Their success stems from the ability to analyze huge collections of data opening up new business prospectives. Devising a novel, clever and non trivial use case for a given collection of data is not enough to guarantee the success of a new enterprise. Data is the main actor in a big data application: therefore it’s of paramount importance that the right data is available and the quality of such data must meet certain requirements.
In such scenario we started the development of DataQuality for Big Data, a new open source project that automatically and efficiently monitors the quality of the data in a big data environment.

Why Data Quality?

One of the main targets of a big data application is to extract valuable business knowledge from the input data. Such process does not involve a trivial computation of summary statistics from the raw data. Furthermore traditional tools and techniques cannot be applied efficiently to the huge collections of data that are becoming commonplace. To achieve the indented objective a common approach is to build a Data Pipeline that transforms the input data through multiple steps in order to calculate business KPIs, build/apply machine learning models or make the data usable by other applications.

 Data Pipeline workflow example

 In this scenario there it’s quite common to have pre-defined input data specifications and output data requirements. Then the pipeline steps are defined, relying on the such specifications to achieve the desired output.
However in most cases we do not have full control over the input data or we may have specifications and assumptions that are based on past data analysis. Unfortunately (and fortunately at the same time) data changes over time and we would be able to keep track of such changes that could violate our application assumptions. In some other cases data can be corrupted (there are a lot of moving parts in complex environments) and we don’t want dirty data to lead to unusable results. We’d like, instead, to intercept and prevent this situation, restore the input data and run our data pipeline in a clean scenario.
Another problem is related to the data pipeline itself. After development tests and even massive UATs there could always be some kind of bug, due to wrong programming, unforeseen corner cases in the input data or to a poorly executed testing/UAT phase. Constantly checking our core application properties through the time could be a good way to catch this bugs as soon as possible, even when the application is already in production.
A continuous Data Quality check on input, intermediate and output data is therefore strongly adivisable.

Approaches to Data Quality

Manual checking of some properties and constraints of the data is of course possible but not affordable in the long run: an automatic approach is needed.
A possible solution could be to implement project specific controls and check of the data. This has the obvious advantage to integrate quality checks in different parts of the application. There are however several drawbacks:

  • Development time consumption: developers have spend time on the design, development and test the controls.
  • Code readability: both the application specific code and data quality one will reside in the same code base.
  • Hard and costly maintainability: additional checks will require a new release of the software.
  • Application-specific checks: every application will have its own controls, maybe redundant on some data input. Data quality checks can’t be executed in isolation without the related application (or it can be done with additional overhead in the development phase).

A better solution is be to have a generic Data Quality Framework, able to read many data sources, easily configurable and flexible to accomodate different needs. That’s where our framework shines!

DataQuality project

DQ is an open-source framework to build parallel and distributed quality checks on big data environments. It can be used to analyze structured or unstructured data, calculating metrics (numerical indicators on data) and performing checks on such metrics to assure the quality of data. It relies entirely on Spark to perform distributed computation.
Compared to typical data quality products, this framework performs quality checks at the raw level. It doesn’t leverage any kind of SQL-like abstraction by using Hive or Impala because they perform type checks at runtime hiding bad formatted data. Hadoop has structured data representation (e.g. through Hive) but mainly it stores unstructured data ( files ), so we think that quality checks must be performed at raw level without typed abstractions.
With DQ is possible to:

  • Load eterogenous data from different sources (HDFS, etc.) and various formats (Avro, Parquet, CSV etc.)
  • Select, define and perform metrics with different granularity (i.e. on file or column level)
  • Compose metrics and perform checks on them (i.e. boundary checks, metrics comparison requirements etc.)
  • Evaluate quality and consistency of data, defining contraints and properties , both technical or domain dependent.
  • Save check results and historical metrics on multiple destinations (Hdfs, MySQL, etc. )

DQ engine is even more valuable if inserted in an architecture like the one shown below:

Data Quality solution overview

 

 

In this kind of architecture our Data Quality engine takes some configuration as input where it finds what data has to be checked, then it fetches it, calculates the metrics and performs the configured checks on them. The results are then saved in a data storage, in the example MongoDB. In this scenario a configuration engine (the green box) could be very helpful to configure sources/metrics/checks with a simple web application or automatically create standard checks for every input listed in a Data Governance of an enterprise. Other components can be implemented on top of this architecture to send reports, build graphic dashboard or apply some recovery strategies in case of corrupted data.

DQ Engine: State of the art

To understand what is the state of the art of the project we first have to better specify what are the main abstractions of the engine:

  • Source: As the name suggests it’s the definition of some kind of data that resides in a data storage system. It could be a table, a text or parquet file in HDFS. It could also be a table on a MySQL DB or any other kind of data.
  • Metric: It is a measure, represented as a number, on the data. It could be related to a file  (e.g. number of rows in a text file), or to a specific column (e.g. number of null values in a given column).
  • Check: A check is a property that has to be verified on one or more metrics. For example we can define a check on a metric w.r.t. a given threshold (e.g. number of null values in a column equal to 0), or w.r.t. another metric (e.g. number of distinct values on a column of data A greater than number of distinct value of a column of data B). Checks give the user the ability to verify technical properties, such as integrity of data, or functional properties, like consistency of join condition over multiple data.
  • Target: This is just the definition of a data storage in which we want to save our results i.e the computed metrics and check results.

In the schema below is described the high level pipeline of the DataQuality engine.

Pipeline of the Data Quality Engine
The engine is powerful and flexible enough to be usable in different contexts and scenarios. New sources, metrics, checks and output formats can be plugged in and can be reused without further implementation.

Below a quick recap of what is already available. As already said any other component can be plugged in.

  • Sources: HDFS files (parquet or text files, with fixed length fields or with a field separator)
  • Metrics:
    • File metrics: row count
    • Column metrics:
      • Count on distinct values, null values, empty values
      • Min, max, sum, avg on numeric columns
      • Min, max, avg on string length in text columns
      • Count of well formatted dates, numbers or strings
      • Count of column values in a given domain
      • Count of column values out of a given domain
    • Composed metrics
      • It is possible to compose basic metrics with mathematical formulas
  • Checks:
    • Comparison logic
      • Greater than
      • Equal to
      • Less than
    • Comparison terms
      • Metric vs. Treshold
      • Metric vs. Metric
  • Targets:
    • Type:
      • HDFS text files
    • Format:
      • Columnar and file metrics
      • Columnar and file checks

Use case scenario

Now we look to a possible scenario to understand a possible usage of DataQuality and its configuration pattern.
Suppose that we are in a bank IT department and our application uses as input data a daily transaction flow that arrives in CSV format on HDFS. The expected structure of the file is the following

  • ID Unique id of the transaction
  • SENDER_IBAN Iban of the transaction issuer
  • RECEIVER_IBAN Iban of the transaction receiver
  • AMOUNT Amount of the transaction
  • CURRENCY Original currency of the transaction
  • DATE Timestamp in which the transaction has been issued

On a daily basis we want to check the quality of the received data. Some required high level properties could be:

  1. Uniqueness of the ID in the daily transaction flow
  2. Correctness of the amount (i.e. positive number)
  3. Currency in a set of expected currency (i.e. USD and EUR)
  4. Correctness of the date format (date encoded in a specific format, e.g. yyyy-MM-dd HH:mm )

Let’s see how we can transform these requirements in terms of source, metrics and checks configuration for the DataQuality framework.
Here you can find a possible configuration file for the above use case
DataQuality Configuration example
In order to read the input file the only thing we have to do is to add a source in our sources list specifying an id, the type of the source, the HDFS path, the format of the file (CSV, fixed length, … ) and other few parameters. The file date in the path definition will be automatically placed in the specified format by the framework depending on the run parameters.
Then we have to formalize our high level requirements in terms of metrics and checks

  1. Uniqueness of the ID in the daily transaction flow
    1. Metric A – File rows count
    2. Metric B – Distinct values count in column ID
    3. Check – Metric A = Metric B
  2. Correctness of the amount (i.e. number with some precision/scale)
    1. Metric A – File rows count
    2. Metric C – Count of rows with well formatted numbers in amount column
    3. Check – Metric A = Metric C
  3. Currency in a set of expected currency (i.e. USD and EUR)
    1. Metric D – Count of rows with currency column value is outside (‘USD’, ‘EUR’) domain
    2. Check – Metric D = 0
  4. Correctness of the date format (date encoded in a specific format, e.g. yyyy-MM-dd HH:mm )
    1. Metric A – File rows count
    2. Metric E – Count of row with well formatted timestamps in date column

In the configuration file metrics and checks are defined separately because metrics are independent w.r.t. checks in which they are used. In fact a metric can be re-used in different checks (e.g. metric A)
In the bottom list there are the output files configuration.
Instructions on how to run the framework injecting a custom configuration file are in the GitHub project page

Conclusion

It should be clear that assuring data quality over time is a must in a big data environment. Achieving such goal with automatic approach is needed and it is not affordable or feasible to build ad-hoc solutions in complex big data environments for every project/data source.
An open, flexible, extensible framework such as DataQuality is the solution that guarantees a high level of control over quality of data in different contexts and scenarios, requiring only minor configuration to extend its usage.
The pluggable nature of the project make it highly reusable and open to continuous extensions. It’s also open-source, so you are welcome to  contribute to the project and make it even better!
GitHub https://github.com/agile-lab-dev/DataQuality

 

If you found this article useful, take a look at our blog and follow us on our Medium Publication, Agile Lab Engineering!