Big Data

AI at Scale isn’t Magic, it’s Data – Hybrid Data


A recent VentureBeat article , “4 AI trends: It’s all about scale in 2022 (so far),” highlighted the importance of scalability. I recommend you read the entire piece, but to me the key takeaway – AI at scale isn’t magic, it’s data – is reminiscent of the 1992 presidential election, when political consultant James Carville succinctly summarized the key to winning – “it’s the economy”. Sometimes the most important issue is hiding in plain view. The article goes on to share insights from experts at Gartner, PwC, John Deere, and Cloudera that shine a light on the critical role that data plays in scaling AI. 

This excerpt from the article sums it up: 

Julian Sanchez, director of emerging technology at John Deere hit the nail on the head, “the thing about AI is that it “looks like magic. There’s a natural leap, from the idea of “look what this can do” to “I just want the magic to scale”. But the real reason AI can be used at scale, he emphasized, has nothing to do with magic. It’s because of data. 

Let this sink in a while – AI at scale isn’t magic, it’s data. What these data leaders are saying is that if you can’t do data at scale, you can’t possibly do AI at scale. Which means no digital transformation. Innovation stalls. Risk increases. Data and AI projects cost more and take longer. Many fail. This leads to the obvious question – how do you do data at scale?

The answer to that question was eloquently articulated by Hilary Mason a few years ago in the AI pyramid. Al needs machine learning (ML), ML needs data science. Data science needs analytics. And they all need lots of data. Ideally they all should work together on a common platform. 

In the article, Bret Greenstein, data, analytics and AI partner at PwC identifies that, “No matter how organizations move toward scaling AI in the coming year, it’s important to understand  the significant differences between using AI as a ‘proof of concept’ and scaling those efforts.” He goes on to say “The key lesson in all of this is to think of AI as a learning-based system.” He’s absolutely right. A proof of concept works from a limited, very incomplete view of an organization’s data. But when that AI system is depended upon to make business critical decisions, the data set must be complete, accurate, and updated on a real time (or near real time) basis.

The takeaway – businesses need control over all their data in order to achieve AI at scale and digital business transformation. As Julian and Bret say above, a scaled AI solution needs to be fed new data as a pipeline, not just a snapshot of data and we have to figure out a way to get the right data collected and implemented in a way that is not so onerous. The challenge for AI is how to do data in all its complexity – volume, variety, velocity. It’s also about how to use data anywhere to provide the most complete and up-to-date picture for the AI systems as they continue to learn and evolve.  

And to do that, you need data, lots of data – think Neo – TB, PB scale. Why? Because that is how models learn. You also need to continually feed models new data to keep them up to date. Most AI apps and ML models need different types of data – real-time data from devices, equipment, and assets and traditional enterprise data – operational, customer, service records. 

But it isn’t just aggregating data for models. Data needs to be prepared and analyzed. Different data types need different types of analytics – real-time, streaming, operational, data warehouses. As Mason said, all the data management, data analytics, and data science tools should easily work together and run against all this shared data. And that data is likely in clouds, in data centers and at the edge. Summing it up – doing data at scale requires data management, data analytics, data science, TB/PB of data and a variety of data types that can be anywhere. Doing data at scale requires a data platform. 

What type of data platform does data at scale best?  First you need the data analytics, data management, and data science tools. Next they should be integrated – easy to use and easy to manage. They all should work on shared data of any type – with common metadata management – ideally open. Common security and governance becomes pretty important, if you are going to get to production. And then there is scale – across clouds and on-prem – and across massive volumes of data, without sacrificing performance.

And not just a simple data cloud or cloud data platform. It should have common management, security and governance tools. It should run on any cloud or on-prem.. We believe the best path is with a hybrid data platform for modern data architectures with data anywhere. Because with AI at scale – “it’s the data.”

Looking to do AI at scale at your organization? Learn more about Cloudera’s hybrid data platform that can provide the data foundation you need.