How to Build & Operate a Data Platform
Quickly deliver high quality data-as-product widgets at scale
By Daniel Block, Chief Architect at Caserta
It is in the managing and solving of these competing priorities that will paralyze or accelerate a data organization’s ability to deliver.
This weekend my family got carry out from a different pizza place than usual. It was rush hour on Friday afternoon and our normal pizza place is currently on the far side of a very busy road construction zone.
Phoning in my order I requested our “usual”: one large combo – no olives and one medium gluten-free combo – no olives. Their response, “Sorry sir, gluten free is only available in a small”.
I immediately felt a camaraderie with hungry internal customers at data constrained organizations that struggle to deliver on-demand data products. All of the raw ingredients are out there for a medium sized gluten free pizza crust. Why couldn’t the kitchen just source them and put them together for me?
Assembling and delivering data-as-product involves the same challenges as assembling and delivering a meal. Both involve gathering ingredients, processing and premixing some ingredients, and then assembling everything into the final product. In fact, building and operating a data platform has so many parallels to building and operating a kitchen platform that the latter can be used to inform the former. So what makes it so hard?
Your data organization must simultaneously possess and operate four distinct and competing(!) capabilities:
- It must respond quickly
- It must deliver quality
- It must never forget
- It must be able to do these 3 things forever
Note that these capabilities are primarily provided by people and processes – but a little technology will be sprinkled in at the right places. These capabilities are very different from the challenges of “the four Vs of data: variety, velocity, volume, and veracity”. Handling the four Vs of data must be an assumption.
Let’s define these capabilities in detail:
1. Respond Quickly
This can be summarized as time-to-value. When a customer realizes they need a new data product (a widget), the data organization can respond quickly and deliver (at least) an alpha version of a widget in a short period of time. Usually, the customer isn’t quite sure what exactly they want but an alpha version of the widget is a great conversation starter to help uncover additional requirements.
2. Deliver Quality
This is robustness, resilience, and meeting expectations. The widget must not break a month after it is delivered. If an upstream change causes downstream issues with the widget, the data organization must realize it before the customer does. The widget must not become stale and must adhere to implicit or explicit SLAs on refresh schedules.
3. Never Forget
This capability has two dimensions. First, don’t forget any part of any data object ever. Once a particular file or record has been stored in its entirety – it must be retained and accessible forever – in its entirety. For example, in April we will want to see what a file looked like from March. But in July, we may also want to see what the March file looked like in April, as well as what the March file looked like in May, June and July. In September, we may want to see what we thought the March file looked like in July – in addition to September. This recursion can go on ad-infinitum. It could be for audit purposes, for trending purposes, or just for normal reporting and analytics purposes. But we must have this capability of recursive time-travel.
The second dimension of the ‘never forgetting’ capability includes the catalog of everything in our data platform (all the data we ever loaded). This catalog requires a thoughtful and robust taxonomy that must be created and maintained to avoid unnecessary (re)work. It is also useful for auditing and other meta-reporting activities.
4. Doing These Things Forever
The last capability of “doing these things forever” speaks to scalability and sustainability. “Scalability” in this sense does not speak to “big data.” Again, being able to handle large volumes of data or heavy processing of data (or any of the Vs) must be an assumed capability. Scalability as we mean it here speaks to the large count of distinct widgets a data organization must deliver and maintain. Some of these widgets may be simple reports. Which are easy to produce and may be rarely maintained. Other widgets may be complex self service BI marts or entire departmental warehouses involving dozens or hundreds of objects. These may have dedicated teams involved in their creation and maintenance.
The sustainability component is straightforward in theory and quite difficult in practice. It speaks to the three areas of people, process, and technology. We can’t burn out our people. We can’t have onerous processes that themselves require a team to maintain. And, lastly, we must have a modern technology stack that takes advantage of tools and techniques that make development and maintenance of widgets a simple undertaking.
If a data organization can master these four capabilities, it will be able to quickly deliver high quality data widgets at scale.
Granted, this is a tautology since these capabilities are wrapped into our claim and our title. So besides the four distinct capabilities themselves, another capability is being able to solve for the competing priorities within and between each capability.
In fact, it is in the managing and solving of these competing priorities that will paralyze or accelerate a data organization’s ability to deliver.
Back to our search for a medium pizza, gluten-free, combo, no olives: we had (only) two competing priorities: high quality and quick. Granted, there’s no such thing as a free meal, but there were lower cost, higher benefit actions we could have taken to achieve a more positive outcome. And obviously, it is the same within a data organization.
Organizations are overcoming these challenges. It is possible. Call Caserta, we’ve been there, done that.