Data Warehousing
Enterprise Data Warehousing is still relevant.
The “big data” era is still very much upon us, ushering in an age of constantly evolving technologies and techniques. Many wonder whether the enterprise data warehouse (EDW) still has relevance in the industry, particularly since many new alternatives exceed the technical capabilities of the traditional EDW at a drastically reduced cost. A year ago, I wrote that the EDW is still a sound concept, albeit one that needs to evolve. That sentiment is still true today; however, the rise of several groundbreaking technologies in the last year makes it clear that meaningful evolution is occurring.
At the core, all EDWs share the central concepts of integration and consolidation of data from disparate sources while governing that data to provide reliability and trust, enabling credible reporting and analytics. But in today’s world of high demand for analytics-driven business decisions, is credible reporting enough?
The Legacy of the Data Warehouse
Over the past few decades, use of the EDW has proven to be a worthy but insurmountable undertaking, with a relatively low success rate. In fact, generally accepted survey data indicates that 70% of data warehouses ultimately fail. Of the 30% of “successful” EDWs, many will never achieve ROI or strong user acceptance. EDW failures can largely be attributed to legacy interpretations of the design and traditional waterfall software development lifecycle (SDLC) approach. A current trend that is helping EDW projects succeed is the use of more modern, agile techniques. These techniques allow EDW implementations to grow naturally and be malleable, as the central requirements for both data and business evolve.
Another point of failure is the traditional EDW does not fulfill all of the data analytics needs of a modern organization. Many organizations—particularly large corporations—view the EDW as the sole solution for all data analytics problems. Consumers of data have been conditioned to believe that if they want analytics support, their only choice is to integrate data and business processes into the EDW program. This often leads to a situation in which extreme IT effort is put into modeling and loading new subject areas into a rigid and governed system before the true requirements and value of the data are known.
In many cases, the core design and technology of the EDW alone is simply not effective to solve the business problem at hand. New requirements such as analysis on semi-structured, semi-governed data, or the ability to analyze streaming data from IoT, or network analysis, search, and data discovery and exploration are all ill-served by traditional EDW methods backed by relational database technology.
Years ago, data mostly came from rigid, on-premises transaction systems backed by relational database technology. However, use cases such as those listed above have become more common in the era of big data. The premise-based systems still exist, but many have moved to the cloud as SaaS models. Additionally, many no longer run on relational platforms, and our method of interaction with them is often via APIs with JSON or XML responses.
Not only have the technologies of our data sources changed, but there are now new data sources, such as social media, sensor and machine data, system logs, and even video and audio. These sources are not only producing data at incredibly rapid rates and with an inherent mismatch to the relational model, but frequently there is also no internal ownership of the data, creating difficulties in governance and conformance to a rigid structure.
As the evolution of business evolves from conventional and executive wisdom, through data supported decision making, to become an analytics-driven enterprise, the technical data ecosystem must evolve into a modern engineered platform for the business to sustain competitive survival.
The Big Data Revolution is Real
In response to new business demands, there has been substantial disruption in IT surrounding the tools and techniques used to store and process data. These innovations were created by relatively new technology companies and continue to evolve as they are embraced and expanded upon by other organizations with their own unique data challenges.
The changes brought on by the big data era are not caused only by access to larger amounts of data; the true catalyst is that departments within organizations—with all types of data—now approach data problems in ways that are tailored to their specific departmental needs. It’s no longer a one-size-fits-all enterprise venture that requires that each need be molded into a traditional monolithic system. IT organizations must support departments to objectively design and build analytics systems based on their specific business and data requirements, not on preconceived design approaches. Of course, a larger variety of options in the technology landscape are required, and so a comprehensive reference architecture for this new data ecosystem is critical for manageability.
Engineer and author Martin Fowler coined the term “polyglot persistence” in reference to this movement. He defines polyglot persistence as the situation “where any decent sized enterprise will have a variety of different data storage technologies for diverse kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we’ll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.”
In other words, it’s natural for an organization to adopt a variety of new storage and data processing technologies based on requirements. The concept is an extension of “polyglot programming” and “microservice architecture,” where languages and platforms are chosen based on the ability to tackle different types of problems.
The Lasting Relevance of the Data Warehouse
Rest assured that the data warehouse is still relevant in this new era—but it is not alone.
Once accepted by the mainstream, big data technologies such as Hadoop were picked up by various organizations to solve the most challenging data problems. Most started out as a proof of concept, and then frequently launched in a production-like capacity. Unfortunately, though, many were built completely in silos, with no regard for enterprise architecture.
“Data quality” and “data stewardship” were considered bad words, and concepts of the “old way” were almost completely ignored in design and implementation. Ultimately, many of these tools suffered from service-level issues and interoperability challenges with other systems. There was a general lack of trust from consumers of data. The concepts behind data warehousing are now becoming critical again, particularly as they apply to big data systems. Analytic systems still need the concepts of data governance, data quality, and data stewardship. And, conformed master data and interoperability between applications still matter. The traditional EDW continues to have a place in modern data architecture, though its primary function has shifted.
Enter Agile and Collaboration
Historically, when building data solutions, the waterfall approach has been used to plan and manage projects. In this way, as development progresses through the phases of the project, data analysts, data engineers, and data scientists complete their work in separate tasks daisy-chained together. Collaboration between these domains is the next step toward the future.
Using agile methods, scrum teams will become more common in the data space. Collaboration between a newly formed chief data office, the business product owner, IT architects, and data scientists and/or analysts is required to quickly formulate and implement data solutions in time for the business to realize its value. Data initiatives that lack members with experience and expertise in the various required domains are doomed to failure.
Technologies that enable collaboration are already being embraced by startups, but within the next year, I suspect more and more organizations will adopt these technologies. There is evidence that the trend toward data analytics collaboration has come to fruition in the form of Apache Spark. Spark’s unique offering of a unified framework for data engineering and data science has caused it to quickly become the default system to meet the demanding needs to have a single interface for data discovery, exploration, ETL, and machine learning.
Vendor selection for big data technologies is no longer the land of startups, as Spark is now embraced by some of the largest technology companies in the world. IBM, for example, recently announced the launch of DataWorks, a Spark-oriented platform. DataWorks is designed to cater to five distinct personas: data engineer, data scientist, business analyst, app developer, and chief data officer. The collaboration is run on Jupyter-based notebook technology extended by IBM, and all five personas will be able to share work within the same interface. Notebooks contain data, code, documentation, annotation, and visualization. I predict notebooks will soon become the new BI toolkit.
The Corporate Data Pyramid
As previously noted, a rigid, highly governed, traditional EDW is not the solution for all problems. It now needs to integrate and coexist with other analytic applications. The EDW will remain a critical component, but it will not be the only significant component.
Dimensional data models still prevail in the world of relational databases, but the legacy of the data warehouse—that all things data are done within a relational database—is gone. The new trend is to utilize a different, more specialized tool for each piece of the job. For example, the ETL does not need to be done within the RDBMS. Much of the data preparation work for the data warehouse environment is now done outside of the relational database. In fact, the relational database is now primarily limited to data presentation. This is indicative of a total paradigm shift to the concept of the “corporate data pyramid.” The CDP is comprised of four tiers: the data landing, data lake, data workbench, and data presentation.
The Data Landing
The data landing area is a quarantined area of the data lake that stores all data in full fidelity as it is received from its source. Data is stored and partitioned to optimize data management (source and data management dates). Data can be archived after a predetermined retention period as needed. Data in the landing area has very little data governance, as it is essentially locked down from all except managed processes that make lightly governed datasets available within the data lake.
The Data Lake
The data lake encompasses the lower tiers of the pyramid. Many companies with some experience in this area are discovering that deeper data analytics and data science can be better done without a structured environment. In fact, to save time and money, businesses are now requiring that this analysis be done before the rigorous process of structuring, formatting, and fully governing the data. The data lake includes feeds of raw events from the landing area, as well as enrichment and reference data that may come from the existing data warehouse or a master data management system. All data is tagged with metadata and can be searched via a data catalog.
The data lake is a great tool for data discovery, exploration, deep analytics, and data science. Typically, data is accessed and manipulated in the data lake via notebooks. Languages such as Python, Scala, SQL, or R are used within a Spark framework.
The Data Workbench
The data workbench is an area that includes ephemeral workspaces with information lifecycle management (ILM) rules and disposes of any stored data (and infrastructure, if in the cloud) after a predetermined time has passed. The data workbench is primarily used by data analysts and scientists who may be doing data discovery or building predictive models that do not need the resulting data to be stored for perpetuity. The automated ILM process keeps the workbench area clean and prevents the threat of creating non-production-supported shadow IT systems. Any data created in the workbench deemed production worthy and made available for other users or applications must go through a rigorous data governance and promotion process before being moved to the data lake or data warehouse.
The Data Presentation
The presentation layer is a fully structured, fully governed environment with its data prepared, formatted, and stored specifically to support arbitrary user queries. This is the area where the average business user accesses data to create and run reports, dashboards, and visualizations. The presentation layer is the “top” tier of the data pyramid, with the assumption that there is much data within the data lake that does not make its way to the fully structured presentation tier. The presentation database can be your existing data warehouse using traditional RDBMS and MPP technologies. However, as SQL-compliant technologies within the big data ecosystem mature, I predict this will become the new standard for presenting data.
The Data Glue
Extract, transform, and load is still required within a modern engineered data ecosystem. The acceptance of Spark by large companies such as IBM is strong evidence that Spark may become the new standard for data transformation and preparation for analytics. Although not required, the ability to use notebooks to create a common interface may make Spark more preferred in those industries where people with varying levels of understanding of technology need access to the data. New-generation ETL tools are now being built from the ground up using Spark as their core operating platform. Look for more growth in this area in the coming year(s).
Future of Enterprise Data Warehousing
With the help of the continuing big data era, the data warehouse is making greater strides toward breaking from its rigid confines. As we noted last year, the change in mindset is even more essential to catalyzing this change. In the next year, larger, traditional companies will embrace the realization that we are experiencing evolution at a breakneck speed, and we will need to be agile and collaborative while developing solutions to different data problems. Moving data solutions to the cloud is also a trend that is picking up speed and will have hockey stick growth in the coming years.
Even as the world of data is literally turned upside down, from the paradigm of “structure → load → analyze” to “load → analyze → structure,” the fundamental principles of traditional data warehousing and ETL will never be forgotten. The fundamentals of data governance and integration are still crucial elements, even in the world of the cloud, big data, and modern data engineering.