What does data platform failure look like?
Does data warehouse automation exist?
What does modern data platform success look like?
The road to hell is paved with good intentions, or so I’ve heard. Well-meaning IT departments have spent years (decades of resource hours) attempting to implement data warehouse automation in order to help their company run better, faster, smoother. The idea of it is so appealing. The reality? Not so much. Neglecting a modern focus for a minute, over the course of the last 30 years (imo) there have been a number of data platform misses that time has forgotten. Much like good intentioned flying machines some gave all early indications that they would not only fly but soar to great heights. A stack of six wings not four nor two – three times as good?! So it goes with data platform solutions over the years!
First things first. In the history of data warehousing there has never been a data platform failure. You heard me – despite the numbers years ago purported by InfoWorld, that 80% of all data platform solutions fail, this simply does not happen in the real world. Why not? It has to do with the way success and failure are defined. A data platform will generally give a company access to its data – but not always in the way in which they need it, or in a timeframe that is helpful for their business. Data platform success is about more than access to data. No IT department will fess up to a total failure of a data system even as the new, shiny data platform successfully fails to manage & serve important data assets to the business in a timely manner. Pay close attention to that last phrase ‘in a timely manner’. Often time-to-value, time-to-data, or whatever metric is used to describe how fast the data gets to the businesspeople who need it is kept open-ended. There, I said it!
So back on the road-to-hell. Most data platform/solution misses over the past decades may be attributed to something called data warehouse automation. This is the longtime promise that all post go-live data will be magically ingested, stored, cleaned, transformed, managed, presented, and maintained in an automated way. Nowadays you can even throw in additional wording such as DataOps or CI/CD for some extra flare! For decades software vendors have touted the idea of complete data warehouse automation without delivering. There is no tool, or set of tools, which will manage any company’s data from end to end without IT involvement. The attempt to achieve complete automation has led more companies down a rabbit hole than over the rainbow to a pot of gold. The reality is that certain steps within your data solution may be combined or consolidated – this is what is being automated and not the whole thing!
In this post, I’ll discuss two different varieties of what is billed as data warehouse automation. In the first instance (here referred to as “Data Solution #1” or “DS#1”), a Data Analyst uses a tool to configure data schemas and data mappings in a very visual, intuitive, business-friendly way. Then the automation comes in – at the push of a button this work gets translated and instantiated as a technical solution without need for a Data Engineer to get involved. In the second instance (referred to as “Data Solution #2” or “DS#2”), by following a design pattern provided by a modeling tool, a Data Engineer is able to configure and process data streams without the need for a Data Architect to get involved. While these tools will provide some amount of automation, they almost never provide the level of automation needed to completely bypass these other roles. Doing so can create issues in data accuracy, system load, and versioning. Because tools can’t see the bigger picture of an enterprise data warehouse environment.
Data warehouse automation is not a new concept. It’s been twenty years since I first encountered the DS#1 variety of data warehouse automation. At the time the one of the largest system integrators in the industry proposed “automated” solutions to their clients that were driven by their existing software partnerships. This was at a time when the data warehouse industry was big on business users having free and easy access to all needed data without any IT involvement. The promise was that companies would be able to move faster and in a more agile way by having decision makers access all the data they needed on a day-to-day basis. This solution did not take into account anything on the back end. Business users, who often had access to substantial budgets, were sold on the idea of being freed from their reliance on the IT department. They could purchase a tool (or set of tools combined with expensive, albeit “short term”, consulting services) that would allow them to visually obtain and manage data without needing to understand the back end or use any programming languages. It was solid sales model. No great surprise – in real life this does not work.
In one example of DS#1 that I encountered, even though client analysts were intended to configure the data solution, configuration of system resources required behind the scenes wrestling with issues ranging from software version incompatibilities, missing database drivers, inconsistent data character sets, and on and on. Compounding these issues, the lead, technical consulting firm was a juggernaut in the industry, so expertise and staffing were assumed to be well under control. The reality is that some of the vendor resources were relatively new and learning both the software and best practices on the job at the time. As the implementation proceeded it became apparent that the requirements gathering process had been incomplete at best, which would have been a red flag much earlier to an experienced Data Engineer, who could have foreseen these issues, appraised the quality of expertise provided by the vendor’s services team, and saved the client organization a lot of headaches and money by being involved throughout the process. The belief that any consulting partner can implement a cookie cutter data solution completely bypassing IT resources is a fantasy, or as in the case of this client, an expensive endeavor.
Thus, the reason DS#1 started to tilt toward a false-success outcome was only partly system-related. The requirements building process fell down more than a bit. Turns out when you get a room full of business stakeholders together and ask them what is required, they will inevitably say ‘We want everything!’ Or ‘We want all available data from the application!’ At this point there was a false sense of security that the ingestion & transformation platform could handle ‘all available data’. This first step along the road to hell was a big one because then the analysts began to configure the metadata layer in earnest. Elements for each source entity were properly configured in order that attribute and measure columns for each subject area were automatically generated into the target, presentation layer database. Cross subject area relationships would get brokered by the platform. A gargantuan data set was mapped and fed into the monolithic, data black box. Magical things started to happen but not always good magical things. The data streams began to slow over time, causing downstream dependencies to fail. As business needs evolved, the cookie cutter solution was unable to keep up. The vendor moved on to the next great activation elsewhere, so the only technical resource available became the vendor ticketing system. Scaling up and out would have to come later in creative, customized (expensive!) ways – it was not long before this company was in the market for a NEW shiny automated data warehouse solution.
Fast forward to the digital transformation era over fifteen years later after the time of the DS#1 implementation and you might wonder if cloud platforms and services had helped to realize the dream of data warehouse automation? Unfortunately, no or at least, not yet. The DS#2 example occurred at a time right after the first data lake bust when mismanaged data lakes became data swamps and turned off those who searched for better data management solutions. At this time the pendulum had just swung back to an interest in modern data warehouses adapted for scaling up/out. Enter Data Vault 2.0 and the ETL/ELT solutions that accommodated this DV2 method for proper storage and also business usage of the data. In all fairness, it’s the companion Extract, Transform, and Load (ETL/ELT) platforms & tools that promised data warehouse automation this time around.
This example of a DS#2 preconfiguration was completed using a visual front-end tool, similar as for DS#1, only this time it was more tailored for upstream, data engineering purposes. Data engineers would not need to wait for a data architect to lead a modeling effort – the tool would handle this. Organically, in an agile way the data warehouse would grow. Properly managed data structures would be instantiated and loaded automatically once proper configuration was provided for new source entities and elements. On the surface, this seemed like a great idea – maintain a proper design and then let the platform/tool handle the rest in an automated way.
Without great surprise, the reality of this implementation varied quite a bit from the initial idea. The DV2 hubs, links, and satellite data structures were loosely analogous to dimensional structures such as dimensions, facts, and snowflaked lookups. The bottleneck for this approach and by extension the automation, was that the number of DV2 structures in the first layer of the data solution grew much faster than the number of structures required in the presentation layer (aka business layer) because additional data structures are commonly proliferated during transformation. This situation created a “black box” situation for both the business and technical sides as orphan data structures are inaccessible from both sides. The “automated solution” did not automatically formulate the business layer structures so a longer stabilization period was needed to dev/test/customize the reporting structures for the presentation layer before steady state reporting could occur. This caused a substantial delay in access to data from the business side, as well as creating the requirement for manual intervention from the IT side. Had a Data Architect been involved with this project from the outset, some semblance of data governance would have prevented the headaches of orphan data structures and the delays of access to information.
Both DS#1 and DS#2 were considered to be successful implementations. But by what metric did they actually succeed? Sure, some processes were automated, but not all. Some people were able to access some data. Some IT roles were avoided for some amount of time. But when measuring the success or failure of an enterprise data warehouse automation solution, is that enough to call it a win? My experience shows that including all IT roles and tempering business side expectations with IT side abilities provides an environment where both sides feel successful.
Lessons Learned
Beware of vendors promising shiny out of the box data warehouse automation solutions
Business solutions that use a technical back end need participation from both sides
Make sure success and failure are defined on your terms – not your software vendor’s
Consider our advisory services – DO NOT let your project successfully fail!