The world of databases is always evolving to fit business and technical needs. This world was once ruled by SQL Relational Database Management Systems (RDBMS). But that has been changing, and quite rapidly over the last decade due to the big data revolution that came about in the preceding decade. In the following I will discusses some use cases, trends and a little ‘theory’ around graph databases, one of the many flavors of NoSQL databases which have come out of this revolution. In short, they are a tool well suited for working in the ever expanding data universe and addressing problems arising from the ever growing complexity.
Graph Database Use Cases
The data universe is highly interconnected. All areas of business incorporate data from other businesses; online retail from social media, social media from online retail, online gaming from social media, finance from online retail and social media, and so on. As the number of sources used between and amongst businesses grows, the interconnectedness of the data universe will also grow. This generates new relations between the data in addition to the relations already inherent to many of the data sources themselves.
Graph Database for Financial Services
Investment organizations such as hedge funds are always on the lookout for information which will help them develop and/or fine tune investment opportunities. As part of their strategy they want to incorporate the relationships observable and discoverable in online retail data and social data to help guide their long term and short term strategies. As the data is arriving in real-time they need to be able to adjust their models as fast as possible and provide real-time guidance. Incorporating this alternative data the investment organization is able to not only predict their own performance, but to empower analysts and executives with confidence in their decision making process.
Graph Database for Online Retailer
Online retailers strive to provide product recommendations to customers in near real-time as well as to adjust product pricing on the fly. By harnessing various relationships between users, products, social trend information, and their interactions online retailers desire to drive consumption through on-demand recommendation engines that generate value in near real-time.
Graph Database for Social Network
The social network is accustomed to dealing with connectivity in their user data. After all, that is their business. What they want to do now is accomodate advertising, integrate a retail solution, identify key influencers and the products to which they are more likely to gravitate. This will enable them to provide targeted ads to users, provide select retail options and help guide their mergers and acquisitions departments towards potential targets, and so much more…
Graph Database for Online Gaming
The online gaming company realizes that most games downloaded are free. They need to make money to stay viable as a company. They have in-game purchases, but this does not always pay all of the bills. So they incorporate mandatory in game advertising. Their repeat user count is dropping, and it is related to the ads, but they can’t afford to remove them. By incorporating content from social media, such as a list of friends and friends of friends (graph data), they are able to identify similar users which will likely use their game. From this they identify ads which have the least chance of driving a user away.
All for of these cases are representative of future business related relational data needs. They are a mix of batch and real-time multi-relationship queries such as are found in recommenders or clustering analysis’. Traditional approaches are unfit for handling the deep relationships required by these queries, but the native structure of graph databases are designed specifically with these use cases in mind.
As a visual aid, take the graph below. The cluster of nodes (circles) on the left with edges (lines) connecting them is representative of some relationships typically found in a social networks graph. The cluster on the right is the same, but for an online retailer. The text inside a node indicates the kind of data represented by that node and the text above an edge indicates the relationship between two nodes. Between the two clusters we see five edges, three connecting the online retailer’s graph to the social network’s (Has_Social_Profile, Is_Friend_Of_Social, and Has_Social_User) and two connecting the social network’s graph to the online retailer’s (Has_Product, Has_Online_User). Note that this wouldn’t necessarily give them full knowledge of the graph. They would only be given knowledge of the graph to the depth they have been given access/permission to.
The online retailer could then use this information to derive better product recommendations based on a social profiles contents, and as their image of the social network grows, make recommendations based upon what friends of the user like.
In the same way, a social network could improve on the click based ad revenue by better serving ads by knowing what products on the online retailer are related to businesses that a Business user has on the retailers site. This could also be improved by using information related to the online user account such as search results.
Graph Database Trends
Graph databases are one of the hottest technologies at this time. Although they do not supplant RDBMS in all situations, they are being found to be useful as the persistence layers of many applications, in answering specific analytics questions. As folks continue to innovate, the number of applications continues to grow. Of the many possibilities (a short google query away) one of the most exciting areas is in data warehousing.
To discuss how graph databases are useful for data warehousing it is useful to begin with the introduction of the concept of a DAG (Directed Acyclic Graph). A DAG is a graph where relationships are only valid in one direction. In fact, this is partially how many graph databases are implemented internally.
Now, armed with this knowledge, where in the world of data warehousing could graph databases come in to play?
In a data catalog the properties relating one data set to another could be
- Used By
- Create By
- Has PII
- Derived From
- Has Comments
- Recommended By
- Good For (relationship to types of use – accounting, analytics, sales, …)
All of these relations from the catalogs DAG can be created and maintained as relationships in a graph database, enabling efficient queries across the catalog. AirBnB has gone down this road and written about it in this blog post.
Three approaches can be taken for managing the workflow for processing data:
- Process completion triggers downstream processes (e.g. Apache Airflow)
- Data change triggers changes in dependent data
- Both 1 and 2 above.
Regardless of the approach taken, the underlying model is a DAG and a graph database. With a graph database as the backend,analytics teams can identify hotspots in the processing and work with engineers to optimize performance.
Although this has historically not been supported as native functionality in many graph database, data lineage can be modeled with a time-based versioned graphs. They are, however, an excellent option for modeling data lineage (example, example). With a well designed graph model, it is easy to track where data is being used, who is using it, dates it was used, and so on. As these relationships are all “from-to”, the graph is a DAG.
This is a very complex and, some might say, contentious topic. But from a high level, think of MDM in a graph database setting as enabling:
- Analysis of completeness.
- Aversion to costly joins for deep entries
- Simplification of join requiring rules
It may not be beneficial or even necessary to use a graph database if the situation is simple, without complex internal dynamics. Otherwise, it is worth considering a graph database.
Types of Graph Databases
Not all graph databases are made the same, and not always are they graph databases. Let me explain.
A graph databases is generally broken down to 3 types: Labeled Property Graph (LPG), Resource Description Framework (RDF) and Hypergraph.
LPG databases have the simplest syntax and are the easiest to understand from an analysts point of view and to use from a developers. They are conceptually a replacement for RDBMS systems due to their simple relational nature and their removal of primary key and foreign key pollution (no indexing required), but only time will tell. The most popular graph databases are themselves, or actively support, LPG. For a list of some graph databases and processing engines that are LPG native or have LPG support, see the Gremlin wiki entry or the TinkerPop website.
RDF databases are generally larger for the same amount of data when compared to LPG, based upon the more complex SPO (REF) model and designed for web usage in mind. A list of vendors can be found on this wiki.
The hypergraph concept never really took off. It was a more complex alternative to the LPG and only has JVM support. The main open source database with this technology can be found here, which hasn’t had a release in over 3 years.
Native vs. Non-Native
Some vendors have made a point of indicating that their solution uses graph native storage. The point of this being that a native storage solution will be more performant and that their solution was designed with being a graph database being first priority. The opposite of that being that when a graph database does not use native storage that the graph capabilities are an extension of an underlying database. This is the case when SQL solutions are extended with graph capabilities linguistically, and the same for NoSQL solutions such as columnar stores. Also, the non-native solutions tend to be much less performant than the native solutions.
The latest trend with Graph Databases and NoSQL databases in general is for the database to have multi-model support. This is a topic for another post, but suffice it to say that these databases, regardless of their storage, are generally designed to support both graphical and document/key-value storage. These are attractive solutions as they support both graph based needs and more generic NoSQL needs simultaneously. Some of the more popular solutions in this space are OrientDB, ArangoDB, DataStax and SAP HANA. This tutorial is a great introduction to the topic, but be forewarned, it is a technical session.