Blog

On Our Mind: MLOps

May 17, 2021

Esube Bekele, Technology Architect / Morgan Mahlock, Senior Associate

MLOps addresses hidden technical debt in ML systems, streamlining the lifecycle from prototyping to deployment through unified management and DevOps integration.

Machine Learning Model Operationalization Management (MLOps) is an emerging category that is growing in response to hidden technical debt in production machine learning systems. While prototyping and experimenting with ML models can be quick due to the strength of existing tools, managing the overall life cycle of the models involves data preparation (DataOps). Furthermore, deploying and monitoring models at scale in production can be difficult and slow. Many organizations have ML researchers/data scientists develop models in isolation and then slowly release finished models into production. This approach is no longer acceptable, hence the introduction of the DevOps principles into ML development and deployment pipeline. However, integrating ML algorithms into software DevOps systems adds a new layer of complexity that must be managed.

Organizations want to democratize AI/ML to deliver value across the enterprise, yet this often creates a fragmented ecosystem of tools and platforms as the different stakeholders (e.g., ML researchers, data scientists, ML engineers, operations professionals) struggle to manage their infrastructure across the full ML lifecycle (of note, we use ML Platforms and MLOps interchangeably throughout this blog). MLOps provides a unified approach to design, build, and manage ML-powered applications in a reproducible, testable, and evolvable manner. MLOps extends the benefits of continuous development (CD) and continuous integration (CI) to ML development.

Figure 1. High-level view of the MLOps pipeline. Source: https://www.incyclesoftware.com/azure-machine-learning-enterprise-accelerator

In this blog we highlight accelerating research in the field, production challenges, and evolving practices for evolving ML.

Accelerating Research

Developing ML systems is gaining strategic importance within commercial companies and governments alike due to the rise of the deep learning-driven ML revolution. According to a recent report on artificial intelligence (AI), between 1998 and 2018, the volume of peer-reviewed AI papers grew by more than 300%, accounting for 3% of all peer-reviewed journal publications and 9% of all published conference papers.

Based on number of publications, China publishes as many AI journal and conference papers as Europe, and it surpassed the U.S. in 2006 (See Fig. 2).

Figure 2. Total number of AI papers 1998 — 2018. Raymond Perrault, Yoav Shoam, Erik Brynjolfsson, John Etchemendy, Barbara Grosz, Terah Lyons, James Manyika, Juan Carlos Niebles, and Saurabh Mishra, Artificial Intelligence Index Report 2019.

Most of China's publications come from government-affiliated authors, while in the United States, corporate entities supply most AI publications. The U.S. still leads with regard to the impact of impact of its papers compared to China and European countries. The U.S. also holds a fairly high level of academic and corporate collaboration on these AI research publications compared to China and European countries (See Fig. 3).

Economic implications of AI-powered applications are increasingly significant, 2019 saw $37B of AI-related startup investment (in contrast with $1.2B in 2010).

Figure 3. AI citation impact vs. total number of academic-corporate AI papers. Raymond Perrault, Yoav Shoam, Erik Brynjolfsson, John Etchemendy, Barbara Grosz, Terah Lyons, James Manyika, Juan Carlos Niebles, and Saurabh Mishra, Artificial Intelligence Index Report 2019.

Production Challenges

The rapid technological advances in ML research eventually hit significant production barriers, which are outlined in a seminal paper on the hidden technical debt of ML. It is often easier to create and prototype complex ML systems quickly than it is to perform the costly task of deploying complex ML systems in production. Further, these complex systems can incur massive ongoing maintenance costs. This expense is due, in part, to fragmentation. The number and costs of development of infrastructure management, performance monitoring, and analytics tools and platforms to support the whole ML product life cycle. The landscape of tools and platforms in this space is fragmented, revealing the need for an overarching organization that might focus on best practices, and efficiently operationalizing performant systems.

Evolving Practices for Production ML

To make sense of this plethora of supporting tools and platforms, a new practice is emerging that provides communication and collaboration between stakeholders (ML researchers, data scientists, ML engineers, and operations professionals) in the ML lifecycle. This emerging practice enables production‚Äìlevel, end-to-end management of the ML development process and lifecycle. MLOps helps design, build, and manage reproducible, testable, and evolvable ML-powered applications. MLOps must be a language-, framework-, platform-, and infrastructure-agnostic practice.

MLOps aims to:

Unify the release cycle for ML and software application release.
Enable automated testing of ML artifacts (e.g., data validation, ML model testing, and ML model integration testing).
Enable the application of agile principles to ML projects.
Enable supporting ML models and datasets to build these models as first-class citizens within software dev lifecycle (SDLC) CI/CD systems (not just the supporting software codebase).
Reduce the technical debt across machine learning models.
Accelerate time-to-value by automating the development and deployment of ML models.
Enable management of infrastructure, optimize productivity, and secure deployment.

What's Next?

As this emerging practice grows, so do the platforms and tools that support. The startup ecosystem, big tech companies, and the open-source ecosystem all play a part in developing the MLOps discipline. Many will benefit by streamlining their ML pipelines and incorporating best practices derived from DataOps and MLOps to build highly scalable, reproducible, testable, explainable, and secure ML infrastructure toolchains.

Resources

For more discussion on this topic, check out Andrew Ng's recent talk on MLOps, or this GitHub list with hundreds of MLOps resources.

library