Sunday, July 31, 2016

A Graph is Worth a Thousand Words, Part 1

What are the real advantages of graph database, neo4j in particular, over relational database? In other words, what are the things that a graph database can do while a relational database can't. This is the burning question a developer/analyst asks when he/she makes the decision whether or not to pick up the new technology. Here is neo4j's own sales pitch.
Why does My Enterprise Need a Graph Database? 
Today’s CIOs and CTOs don’t just need to manage larger volumes of data – they need to generate insight from their existing data. In this case, the relationships between data points matter more than the individual points themselves.
In order to leverage data relationships, organizations need a database technology that stores relationship information as a first-class entity. That technology is a graph database.
Ironically, legacy relational database management systems (RDBMS) are poor at handling data relationships. Their rigid schemas make it difficult to add different connections or adapt to new business requirements.
Not only do graph databases effectively store data relationships; they’re also flexible when expanding a data model or conforming to changing business needs.
Obviously, it is targeted at CIOs and CTOs. If you are a developer or analyst from RDBMS/SQL world, and feel somewhat detached from the above sales pitch, below is my translation for you.

Graph database maximizes discoverability of information

A relational database captures the information of a business system as a mixture of data (entities) and metadata (relationships). Metadata is as important as data. It glues data together, and gives data meanings. Yet it is treated as a second-class citizen. You can discover information (through queries) only from data. You cannot ask questions (query) about metadata. Metadata must be obtained from other (usually costly) methods BEFORE you write queries. Here is a simple example to illustrate the point.
This is the universe of information (some movies, persons and their relationships) to be captured. There are 2 different designs. 

Both designs capture exactly the same amount information. The difference is that a portion of the information (relationship type) is captured as data in design #2, and metadata in design #1. The more information captured as data in design #2 allows more types of questions to be asked. For example, is Tom Hanks related in anyway to Finding Dory? In order to find the same information in design #1, the question has to be broken down to 3 smaller ones. 1) Is Tom Hanks an actor in the movie? 2) Is Tom Hanks a director of the movie? 3) Does Tom Hanks rate the movie? The breaking down of the question requires prior knowledge of the metadata. The information captured in metadata is the information lost discoverability. The more complex a business system, the more relationships among entities, the more information captured as metadata, the more discoverability lost.

A graph database, on the other hand, captures ALL information as data, including relationships. Therefore, it maximizes a business system's discoverability, no matter how complex it is. This is why a graph (database) is worth a thousand words (of metadata in RDBMS).