This is part seven of a multi-part series to share key insights and tactics with Senior Executives leading data and AI transformation initiatives. You can read part six of the series here.
Now that you’ve completed the hard work in the first six steps outlined in our blog series, it is time to put the new data ecosystem to use. Organizations must be really disciplined at managing and using data to enable use cases that drive business value. They must also establish a clear set of metrics to measure adoption and track the net promoter score (NPS) so that the user experience continues to improve over time.
If you build it, they will come
Keep in mind that your business partners are likely the ones to do the heavy lifting when it comes to data set registration. Without a robust set of relevant, quality data, the data ecosystem will be useless. A high level of automation for the registration process is important because it’s common to see thousands of data sets in large organizations. The business and technical metadata plus the data quality rules will help guarantee that the data lake is filled with consumable data. The lineage solution should provide a visualization that shows the data movement and verifies that the approved data flow paths are being followed.
Some key metrics to keep an eye on are:
- Volume of data consumed from and written to the data lake
- Percentage of source systems contributing data to the ecosystem
- Number of tables defined and populated with curated data
- Percentage of registered data sets with full business and technical metadata
- Number of models trained with data from the data lake
DevOps — software development + IT operations
Mature organizations develop a series of processes and standards for how software and data are developed, managed and delivered. The term “DevOps” comes from the software engineering world and refers to developing and operating large-scale software systems. DevOps defines how an organization, its developers, operations staff and other stakeholders establish the goal of delivering quality software reliably and repeatedly. In short, DevOps is a culture that consists of two practices: continuous integration (CI) and continuous delivery (CD).
The CI portion of the process is the practice of frequently integrating newly written or changed code with the existing code repository. As software is written, it is continuously saved back to the source code repository, merged with other changes, built, integrated and tested — and this should occur frequently enough that the window between commit and build is narrow enough that no errors can occur without developers noticing them and correcting them immediately.
This is particularly important for large, distributed teams to ensure that the software is always in a working state — despite the frequent changes from various developers. Only software that passes the CI steps is deployed — resulting in shortened development cycles, increased deployment velocity and the creation of dependable releases.
DataOps — data processing + IT operations
DataOps is a relatively new focus area for the data engineering and data science communities. Its goal is to use the well-established processes from DevOps to consistently and reliably improve the quality of data used to power data and AI use cases. DataOps automates and streamlines the lifecycle management tasks needed for large volumes of data — basically, ensuring that the volume, velocity, variety and veracity of the data are taken into account as data flows through the environment. DataOps aims to reduce the end-to-end cycle time of data analytics — from idea, to exploration, to visualizations and to the creation of new data sets, data assets and models that create value.
For DataOps to be effective, it must encourage collaboration, innovation and reuse among the stakeholders, and the data tooling should be designed to support the workflow and make all aspects of data curation and ETL more efficient.
MLOps — machine learning + IT operations
Not surprisingly, the term “MLOps” takes the DevOps approach and applies it to the machine learning and deep learning space — automating or streamlining the core workflow for data scientists. MLOps is a bit unique when compared with DevOps and DataOps because the approach to deploying effective ML models is far more iterative and requires much more experimentation — data scientists try different features, parameters and models in a tight iteration cycle. In all these iterations, they must manage the code base, understand the data used to perform the training and create reproducible results. The logging aspect of the ML development lifecycle is critical.
MLOps aims to manage deployment of machine learning and deep learning models in large-scale production environments while also focusing on business and regulatory requirements. The ideal MLOps environment would include data science tools where models are constructed and analytical engines where computations are performed.
Unlike most software applications that execute a series of discrete operations, ML platforms are not deterministic and are highly dependent on the statistical profile of the data they use. ML platforms can suffer performance degradation of the system due to changing data profiles. Therefore, the model has to be refreshed even if it currently “works” — leading to more iterations of the ML workflow. The ML platform should natively support this style of iterative data science.
Communication is critical throughout the data transformation initiative — however, it is particularly important once you move into production. Time is precious and you want to avoid rework, if at all possible. Organizations often overlook the emotional and cultural toll that a long transformation process takes on the workforce. The seam between the legacy environment and the new data ecosystem is an expensive and exhausting place to be — because your business partners are busy supporting two data worlds. Most users just want to know when the new environment will be ready. They don’t want to work with partially completed features, especially while performing double duty.
Establish a solid communication plan and set expectations for when features will come online. Make sure there is detailed documentation, training and a support/help desk to field users’ questions.
After a decade in which most enterprises took a hybrid approach to their data architecture — and struggled with the complexity, cost and compromise that come with supporting both data warehouses and data lakes — the lakehouse paradigm represents a breakthrough. Choosing the right modern data stack will be critical to future-proofing your investment and enabling data and AI at scale. The simple, open and multi-cloud architecture of the Databricks Lakehouse Platform delivers the simplicity and scalability you need to unleash the power of your data teams to collaborate like never before — in real time, with all their data, for every use case. For more information, please visit Databricks or contact us.
This blog post, part of a multi-part series for senior executives, has been adapted from the Databricks’ eBook Transform and Scale Your Organization With Data and AI. Access the full content here.