Knowledge Graph

The Comprehensive Guide to Knowledge Graphs

In an era dominated by data and AI, building context is the holy grail of any large-scale information system. Simply owning a vast amount of data is no longer enough to empower successful business decisions. To generate insights from multiple, large datasets, one must find a meaningful way to connect different sources and identify patterns across connections.

In this post, we'll break down what knowledge graphs are and why they matter.

 

What are Knowledge Graphs?


Knowledge graphs offer a powerful way to represent not just data, but the relationships and semantics behind it.


 

If the word "graph" brings to mind a chart or a plot, think again. The kind of graph we're talking about is a mathematical structure made up of two elements:

  • Vertices—also known as Nodes,
  • Edges—also known as Relationships.

Picture a set of points connected by lines. This structure dates back to the 18th century and was formalized by the Swiss mathematician Leonhard Euler. It's often used to model real-world entities, and the ways they interact. An 'entity' can be anything from a person to a place to an object. Nodes represent the state of an entity, while edges represent the way they relate to other entities, or even to themselves.

A classical example of a graph is a social network. Each profile (the node) represents a user (the entity under focus), and users connect to other users by a friendship relationship (the edges), "is friend with".

We can take it further and consider a more complex scenario: imagine a seller (another type of entity) creates a profile to sell their products. We could represent the selling relationship by adding a new edge: "sells". Then, products could be represented by a new type of node, so we could use a new relationship to connect the shop to the products it sells.

Representation of a social network

Introducing Property Graphs

By now, you should have a clear understanding. However, there are many different types of graphs, but we'll focus only on a specific kind of graph: the property graph.

In our previous example, the friendship relationship was an instance of an undirected edge: if Alice is friends with Bob, then Bob is friends with Alice. The selling relationship, on the other hand, is directional. A shop can sell a product, yet the product cannot sell the shop.

From now on, we are just going to consider directed graphs. They are more general and can represent the same information as undirected graphs, as well. Indeed, we are going to model the "friendship" relationship using two pairs of edges. One from Alice to Bob. One from Bob to Alice.

Let's break down how to describe nodes and relationships. Nodes can carry as many labels as necessary to categorize the entity type. For example, Alice might be labeled both "User" and "Seller." Nodes can also can carry additional information in key-value pairs. For example, properties like name, shop_name, or shop_location.

Relationships are stricter. They must have a type as a single identifier. They must connect two nodes, from a start node to an end node, but can also store properties in the same way as nodes, using key-value pairs. Most importantly, relationships cannot exist on their own. No dangling relationships are allowed.

 

A simple knowledge graph showing the relationship between Alice, the protagonist of Alice in Wonderland, its author, Lewis Carroll, and a feature in the Vanity Fair magazine after it was published.

 


A node can have zero or multiple labels, many properties, or none at all.
A relationship can have a single type, zero or multiple properties, and must exist between two nodes with directionality.


 

Rethinking Modeling: how Knowledge Graphs Move Beyond Tables

Relational databases and graphs are often used to represent the same entities, but how they structure and retrieve data differs greatly.

Let's begin by looking at what they have in common. In the same way that nodes store properties in key-value pairs, tables store data in rows that associate values to columns, also known as dimensions. So far, so good!

When we consider facts, relational systems use foreign keys and joins to relate different tables together. The link in such a context is symmetrical and lacks directionality. Graphs store facts as triplets: a subject node, a relationship type, and an object node. The directionality becomes especially important in modelling complex systems, because knowing whether the A node influences B or B influences A is often crucial to understand implications.

While SQL queries return only what is stored explicitly in tables, knowledge graphs enable us to extrapolate inference about the entities. In a simplistic picture, relational systems store data, while graphs understand it! Hence, the name knowledge graphs.

As an example, let's consider a customer who is shopping for a portable camping tent. A knowledge graph knows that:

  • The tent is tagged with a "camping" label.
  • Other users who shopped in the camping category also interacted with water filters.
  • As a result, the graph can recommend a water filter to the customer.

Even if the user has never interacted with such a product, the graph can infer interests and preferences based on connected behaviour, which isn't just a data-driven reaction to past behaviour. It's a context-aware interpretation of users' needs.

Knowledge graph for buyer recommendation. With the use of a knowledge graph, a system can recommend certain products, relying on the knowledge graph of which user bought what.

Beyond Schemas: Building Meaning from Taxonomies and Ontologies

If the previous example seemed too good to be true, it's because we haven't yet discussed how to construct meaning.

In a similar fashion to how relational databases can leverage schemas to enforce consistency in row entries, knowledge graphs can use a taxonomy to organize labels.

A taxonomy is like a conceptual map that organizes entities into categories, and categories into hierarchies. Think:

  • A "Tent" is a kind of "Camping Gear",
  • which is a kind of "Outdoor Equipment",
  • which falls under "Sports and Recreation", and so on.

In a nutshell, taxonomies encode "is-a" characteristics in broader-narrower cardinalities.

However, a given knowledge graph can include multiple taxonomies, having even conflicting hierarchies. Different taxonomies can classify users in several ways: one does it by family and friendship ties, another by professional relationships, and a third by shared interests. Graphs can embrace multiple "truths".

To go beyond these vertical chains of inheritance that model classification, we also want to model the cross-cutting relationships. That is where ontologies come into play. Ontologies bring relationships among the hierarchies, not just within them. An ontology achieves knowledge by formalizing the types of relationships allowed (the actual description of the edges available), the constraints on the nodes involved, and what logical properties are given to relationships.

Similarly to taxonomies, ontologies are not intrinsic to the graph. It's up to us to specify them externally. From a performance standpoint, graph traversal algorithms scale with the number of edges, but retain a constant complexity with respect to the number of nodes. In other words, the speed of a knowledge graph database does not depend on the amount of data stored, but on the complexity of the ontology(s) applied to it. That is why Ontology Engineering is a distinct subject, with specialized languages to define it and public repositories to share it.

Put together, taxonomy and ontology give rise to semantics: shared and machine-readable meaning distributed across nodes. By leveraging the topological structure of nodes and edges, as well as similarities between labels, properties, and relationships, we can build a precise representation of the world. This structure is the foundation for powerful features like entity alignment, where two systems using different vocabularies can still refer to the same real-world entity.

A knowledge graph showcasing entity alignment

Entity alignment is the task of recognizing when two nodes from different graphs (or different taxonomies within the same graph) refer to the same real-world entity, despite surface differences. A very powerful application to remove duplicate or reconcile entities stored in different systems, which is a common problem in data integration between different databases, or sources.

 

Conclusion

Wow, we unpacked a lot of information here! If you made it this far, congratulations!

This post provides a high-level overview of knowledge graphs, avoiding the details of their numerous languages and implementations.

The main takeaway is that knowledge graphs are a powerful tool to represent complex systems, and can be used to build context-aware applications. If context is what you are after, then consider applying a knowledge graph to your data.

There surely is a motley of applications for knowledge graphs, the most notable being:

  • Data Catalogs, Linage, and Governance
  • Fraud detection systems
  • Master data management and deduplication
  • Metadata management between Data Lakehouses and Warehouses
  • Personalized marketing and recommendation
  • Semantic search engines
  • Supply chain optimization
  • Training of AI and machine learning models

 


At AgileLab, our mission is to elevate the Data Engineering game. If you think your organization could benefit from any of the above applications, or if you are interested in learning more about knowledge graphs, get in touch with one of our Data Engineers today!

Similar posts