MLOps Without Magic

Simple, effective and maintainable.

9 min readAug 18, 2023

This is a second post in the Loris.ai MLOps series.

Improving CX conversations. http://loris.ai

In this installment we will go over our favorite MLOps implementation that is simple to achieve with minimum external libraries.

If you rather jump straight to the code, here’s the repository https://github.com/SerjSmor/mlops_template.

Disclaimer: we don’t hold anything against magic. We just don’t like it in our pipelines.

TL;DR

This series explain how to implement intermediate MLOps with simple python code, without introducing MLOps frameworks (MLflow, DVC …).

This is on purpose, because most common use cases can be achieved through the patterns we propose in this blog. Those principles are simple and can be easily carried over to any DS project.

Experiment tracking on the other hand has a clear standard. Using wandb or neptune, will get you very far very fast.

Content

Intro
Project structure
— Artifact folders
— Command line tools
MLOps heart : tasks.py
— Creating a pipeline
— Fetching data
Archiving experiments
— Archiving locally
— Uploading/downloading experiments to/from the cloud
Summary

Intro

If you’re not sure what MLOps stands for (not the actual acronym but the need for), you can read our previous Medium installment https://medium.com/@serj-smor/philosophy-of-an-experimentation-system-mlops-intro-b864b0339323.

My interpretation to MLOps is similar to my interpretation of DevOps. As a software engineer your role is to write code for a certain cause. DevOps cover all of the rest, like deployment, scheduling of automatic tests on code change, scaling machines to demanding load, cloud permissions, db configuration and much more.

As a software developer the more versed you are in DevOps the better you can foresee issues, fix bugs and be a valued team member.

Same analogy applies to MLOps.

As an ML engineer you’re in charge of some code/model. MLOps cover all of the rest, how to track your experiments, how to share your work, how to version your models etc (Full list in the previous post. ).

Also same expertise rule applies for an ML engineer, the more versed you are in MLOps the better you can foresee issues, fix data/model bugs and be a valued team member.

99% of best practices can be achieved by following the project structure we propose and implementing the patterns below.

Project Structure

I like WYSIWYG attitude, so without further ado here’s a screenshot of the file structure.

All of the scripts under the general project folder are “tools” that can be run from command line

So what project structure do we favor?
What project structure can give us a clear separation of concerns?

Python has different flavors, and some freedom about the location of scripts and components. Hence it’s important to choose a certain configuration and implement it throughout the team.

This project structure is also applicable to the new LLM world we’re all been introduced to. Not having a local model is not an excuse to throw organization, versioning and just good ol’ clean code patterns for.

P.S If you already recognize this pattern, you can skip to archiving your experiments.

Breakdown

Artifact folders: data/ models/ results/

Altogether these folders represent one experiment.

We focus on on data/, models/ and results/

Artifact folders aren’t committed to .git (unless you’re using DVC like patterns)

Data might contain user’s personal information
— Similarly you won’t commit passwords to .git, right? Right?
Models are very heavy and will waste a lot of time to commit and sync them

By having these folders, you simplify the code that uses them (and anyone that reads them).

The rest of the code have default values for all of inputs and outputs.

Some experiments are valuable to save and share — the ability to archive and share experiments will be presented in “Archiving Experiments” section.

Data

Data folder contents: raw.csv, transform.csv, clean.csv, train.csv, validation.csv, test.csv

The data will contain all of the input files that you will use. Input files vary, you start with the raw input, but any transformation that you’re doing should be kept in the folder as well.

Why should you keep local transformations as well? For debugging of course!

Did you ever got 100 F1 on validation and sensed there must have been a bug somewhere?

Going over the different .csv transformation would the first best step.

Models

This directory will contain checkpoints, and the final model of this experiment.

Results

This directory will contain any information that is created after the model has been created.

Any graphs, charts, and prediction .csvs.

Command line tools

These scripts are visible by entering the project on purpose.

They are the interface to the project: preprocess.py, train.py, predict.py, results.py and tasks.py.

Any script in the general directory is callable from CMD

Each one of them have argparse implemented to enable running from CMD/terminal.

Why is this a requirement?

It enables you to build blocks of code that are easily chained through command line invocation
Running a full pipeline is akin to $ python [preprocess.py](<http://preprocess.py>) {arguments}, $ python [train.py](<http://train.py>) {arguments}
It enables you to run your code from any remote server by cloning your git repo and running (1.a)
This allows you to run on GPU server (EC2 — cloud) or on a local workstation
This interface is can be shared with your colleagues — this can help to reproduce results or just enable you work as a team for a single project
For best experiment run $ python [preprocess.py](<http://preprocess.py>) {--remove-duplicates}, $ python [train.py](<http://train.py>) {--model distilbert}
This interface allows you to tune hyper-parameters in a systematic way — I’ll explain this in the Pipeline section

preprocess.py

Everything starts with preprocessing. We assume that we are working with some data that needs to be transformed, whether its images, text, audio or tabular data.

All of the preprocessing steps assume that the data of interest is in data/ folder.

Our assumption is that the files we need to process are already there. More on how to achieve this — in the Pipeline/Fetching data section.

Lets take an example of an audio classification task.

In the data folder there will be a “raw” folder with .wav files.

During preprocessing we will have to:

Filter rows
— Check and remove duplicates
Remove abnormalities (too short, too long)
Extract features
Create a train, validation and test sets

The same pattern will apply for text classification with differences in implementation of the steps.

Extracting features might be not relevant — especially when working with Deep Learning (no need to lemmatize, remove stop words and more). Instead you may want to chunk long text, split paragraphs and more of that sort.

At the end of the script you should create the output for the next stage, meaning train.csv, validation.csv, test.csv under the data/ folder for train.py

train.py

Because preprocessing step was finished already, train.py can be slim and focused.

Preprocessing step have created 3 files (train.csv, validation.csv, test.csv), which train.py will pd.read_csv() them.

Command line arguments can help as to pass hyper-parameters depend on the task and modality you’re working, but there are a couple of hyper-parameters that I bet will be relevant: batch_size, learning_rate, epochs.

At the end of the script you should save a model in models/

If you want to be able to do error analysis on this experiment, you must save the predictions of the model on each of the sets (training, validation, test). Those become the first files in the results/ folder.

predict.py

Other than training, it’s pretty basic to run a model on arbitrary data. Maybe you have periodic testing, and you want to run a model on 1K samples? The output of some SQL query can become the input to this script.

Potential input for the script: 1. csv with data, 2. a single data point (text, audio, image)

The Predict tool will handle that for, yes, you guessed it correctly, to predicting class results, by using the model you have saved in models/ or an arbitrary model.

At the end of the script you should save prediction results (if the input was a .csv) in results/ folder.

It is very handy to test inputs without any relation to production and its drawbacks.

results.py

This script, goes hand in hand with predict.py and train.py.

Do you want to run classification reports and confusion matrices?
Top k accuracy?
Precision-recall curves?
Various distributions?

This is the place to write complex analysis to help you with error analysis.

The script should save various outputs in the results/ folder.

tasks.py

Our MLOps code will reside in this script. This is the heart of the operations, setting the stage for downloading, fetching, uploading experiments, running pipelines and everything non ML. Just follow to the next section.

MLOps heart — tasks.py

Finally we’re introducing an external library. And no, it’s not Airflow or WANDB 🙂

It’s py-invoke, and its best at what it does — calling bash from python, thus enabling you to chain multiple scripts in one specific file called tasks.py.

Here’s the content of tasks.py for our demo project:

By decorating a method as @task, you create a CMD interface for it automatically — without introducing argparse. Invoke library does all the pumbling for you.
Running invoke from cmd: $ inv download-best-model

We’re decoupling MLOps from actual ML code. ML code should handle training, preprocessing etc.

From now on, we will use tasks.py as our MLOps entry point. This creates a standard that simplifies ML scripts and sets expectations for the rest of the team on where things are.

Fetching data

We’ve gotten used to fetching data from the cloud. This is the best practice for collaboration and privacy.

def prepare_data_for_pipeline()

Enables you just that. If you have a complex fetching strategy, this is a perfect place to extend it.

Pipeline

Depending on the complexity of the project, number of the pipeline stages can vary from 2 to 20.

def pipeline()

Our demo has only 3 stages, some stages can be methods and some might be running an entire script.

Archiving experiments

Experiments are the core of ML.

MLOps should take care of how do we: store, share, upload and download them.

We will move all of the directories that are related to the experiment to a special local folder which is the archived_experiments. This folder can be then synced to the cloud, to enable persistent storage and sharing experiments to colleagues.

Archiving locally

Here’s an implementation of a simple procedure to store experiments in a folder called archived_experiments.

The method archive() is defined in tasks.py, and is invoked through the command line, as seen in the screenshot.

tasks.py tasks enable you to call command line processes easily. Each c.run is sequential/synchronous by default, so each command will finish and then continue to the next one.

Uploading/downloading experiments to/from the cloud

def sync_to_remote, will upload any local data that doesn’t exist in S3.

sync works in diffs, similarly to .git.

Similarly sync_from_remote does the opposite, allowing other members to easily download your experiments, thus enabling collaboration and reproduction!

Summary

This MLOps system is not unique, we’ve learned these patterns from our colleagues, interwebs and friends. We did adjust it (fine tuned :)) to our needs. It’s very useful and can be a huge boost to team productivity, especially for new teams.

It’s simple enough to write it on a paper napkin, but can be as complex as you want, without introduction of any magic!

In summary there are only two patterns introduced:

Scripts (tools) for ML
tasks.py for MLOps

All of the code is in the repo, https://github.com/SerjSmor/mlops_template, feel free to fork or star the project.

In the next episode we will delve into advanced patterns such as creating a project template using cookie cutter, reading remote .csvs without changing code, managing a python library for the basis of collaboration and much more!

WRITER at MLearning.ai / 800+ AI plugins / AI Searching 2024

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com