Has Big Data Killed the Star Schema? — Eric Linden
If you are versed in the concepts of data warehousing, you probably know the star schema quite well. Though not without its faults, the star schema is one of the most effective data mart designs for running simple queries. This is largely due to the simplicity of the model itself. However, as big data continues to boom and evolve, those of us who have been in the industry since before it became “sexy” are left wondering: does the star schema still have a place in the data ecosystem of today?
Before we can answer the million-dollar question, there is one other big question we must ask ourselves: What’s happening out there?
Well, for starters, “big data” has become such a popular buzzphrase that it has almost lost all discernible meaning – almost. No matter what the term means to you, there’s probably a certain attitude that you associate with it: “there’s no such thing as too much.” This particularly applies to storage.
Someone stuck in the big data mindset might think, “Storage is so ridiculously cheap that it doesn’t matter how much data is replicated. So, I’m just going to merge and store data wherever and whenever is necessary.” The trouble is: with time and increasing diversity of data sources, this approach will create a big mess that a data warehouse with data marts built on star schemas just can’t handle! Anyone attempting to mine data this way will inevitably become very confused.
A reasonable objection to this assertion would be that big data doesn’t really have to deal with data that gets updated. We’re talking about massive volumes of transactions, trades, hits, impressions, sensor observations, and the like. Although more often than not, this data is static and doesn’t get updated, some transactional data can be updated or even deleted. This presents challenges in the big data environment.
Additionally, transactional data is much more valuable when it can be linked to other related datasets – datasets that may elude the purpose-built star schema. In fact, commonly these other related datasets are reference data, which tend to change over time with varying frequency. Without the right set of techniques, dealing with changing reference data can be a big problem.
We’ve established that star schemas aren’t optimal for running large analytics and trending. If that’s your endgame, a well architected data lake is a better option. But not all constituents of data necessarily need to run analytics. Data architects and engineers who are designing data organization solely for presentation purposes will still find the star schema to be a viable model, and the consumers of said data should have no problem making use of it.
So, has big data killed the star schema? It hasn’t yet, and it is my opinion that it never will. While it may not be the data ecosystem staple that it once was, there’s something about its simplicity that will never go out of style.
Eric Linden is a Principal Consultant at Caserta. He has been in the data industry for over 3o years, with 25+ years experience in RDBMS/Business Intelligence. He recently transitioned to big data and cloud storage & computing, and now serves as a Solutions Architect for many of our most high-profile clients.