Applying MLOPS practices with SageMaker

Introduction

While working with web and mobile applications using backend serverless architectures, one of the most important considerations to keep in mind is that you will need to process, store and manage the data collected using these solutions.

We would like to share our approach on this topic using AWS SageMaker Pipelines. We have started with public data taken from kaggle to design and deploy our models in development and production environments. It was challenging to manage multiple environments as it was interesting to provide quick iterations of our models.

Probably you are here because you are already implementing a ML model or you are just planning to do it.

Key questions to ask before getting started

Is it really necessary to create a ML model for the problem we want to solve?
What do I need to solve my problem? take a look at the AI CANVAS which is a great tool to structure the process of creating a ML model.
How to ensure that a model will be deployed to production and avoid a waste of money for our clients? Keep in mind that the implementation of these models is not cheap and some of these projects are risky for most of the companies in terms of budget (around 80% of the models implemented by companies never see the production environment).
Monitoring: how to scale the model as the data evolves? Take a look to some resources to monitor your models and explore the option to apply model drifting.
How to reduce the technical debt to implement Machine Learning Models as the project grows?

...

MLOPs Introduction

We decided to go with MLOPS which is defined by the MLOps SIG as “the extension of the DevOps methodology to include Machine Learning and Data Science assets as first-class citizens within the DevOps ecology”

Through this series of articles we will be providing more information about how to deploy and put in practice MLOPS using the tools provided by SageMaker because AWS has made a big effort to develop and document a complete architecture focused on ML best practices.

Key concepts

The term Machine Learning Pipeline comes in as a set of operations which are executed to produce new models.

Feature engineering: (going to take as reference this astonishing article about the topic “Feature engineering, importance tools and techniques for machine learning”) This is a fundamental topic in ML and it’s defined as the process of selecting, manipulating and transforming raw data into features that can be used for our model to predict values or apply classification. This process is overlooked or not so important for some developers but in our point of view is the MOST IMPORTANT step in the process because:

Simplify data transformation and enhance model accuracy
A feature not well managed will have a really big impact on your model

Model training: provide training data to a ML algorithm. When the process is complete you are going to have a model artifact commonly denominated as ML model.

Model evaluation: It is very important to evaluate the model with the data you have available before you try to do a real prediction. It is important to check some metrics like accuracy and precision to determine if the ML model is doing well.

Overfitting: You will try to create the best model possible with the data you have available and your model could perform perfectly but when it is unsuitable for anything else.

Note: These concepts provide a starting point to understand the ideas developed and probably you are going to find more in all the articles of this series.

Let’s continue with the tools we are going to use:

SageMaker Model Registry: Helps you to approve ml models, apply versioning and create artifacts after approval.
SageMaker Projects: Helps us to create end to end ML projects with CI/CD this is a very important feature to integrate the concept of MLOPS
SageMaker DataWangler: Import, prepare and extract features in SageMaker Studio, through our experience this is a great tool and a really important step in every project related to ML which is Feature Engineering
SageMaker is a great tool to prepare, build, train and deploy ML models by combining a different set of utilities ready for deployment in production environments, it is important to mention that most of the time you have to use these tools separately and manage all of them by yourself.
SageMaker Studio: A machine learning environment where you have available all the tools to build, test and deploy your ML models.
AWS CodePipeline: is a continuous delivery service that helps you automate your release pipelines to update your application code and infrastructure.

...

So this is all,

I would like to give a special recognition to my co-worker Santiago Vasquez who has been working with me on this project, together we are trying to provide something useful for the community and the Colombian AI industry.

Thanks to

TBBC, our company is providing the space for innovation to flourish.