Philosophy of an Experimentation System — MLOPs Intro

6 min readApr 28, 2023

What project structure suits data-science “experiments”?

This is the first part of a five part series (1/5) on MLOps, brought to you by the ML team at Loris.ai.

Next part: https://serj-smor.medium.com/mlops-without-magic-100365b22d1a

Intro

Loris ML team consists of engineers that have different skillsets, some are leaning more to ML and some to DS.
First ML/DS roles in startups tend to require a little bit of both. That’s why I don’t try and differentiate between the different roles.
For brevity I’ll stick to just ML for the rest of the series.

You might think this is not relevant for you, because you’re an algorithm developer, you don’t care about “systems” or “architecture”. Well then I have a surprise for you.

Did you know that most ML projects don’t get to production?
This is due to varied reasons, one of those is that production is a mess and requirements change frequently and if you have a little bit more knowhow of what “production” is, it would totally help your “baby” to be shipped. Another reason that projects don’t get deployed is because their value is not tangible to stake holders, regarding that, some interactivity can give you a big boost of believing.

But first, what are the signs that you’re in an ad-hoc state of ML development?

Antipatterns

The following list composed by my experience, meaning that I have once done each of the bullets of the list. This is not a shaming fest, rather a chance to view our behavior critically, a necessary thing in order to improve in every practice.

Have you ever sent a notebook to an engineer?
You work mostly locally in a single Conda env across all projects (Exception: M1 users)
Folders on your local machine are called “client_a_2023_feb” containing client data
You’re training models in a notebook and save them locally
Pre processing and training in the same script/notebook
You’re doing error analysis by looking at static CSVs/ static W&B dashboards
Committed data or models in .git (unless you’re working with DVC like paradigms)
During business oriented meetings you’re showing off confusion matrices and classification reports
You work in silo even if you have other team members beside you
You try to avoid collaboration because explaining how to continue on your results might take several days

Last disclaimer — even though there are certain times that ad-hoc solutions are appropriate, mostly they aren’t and can be avoided for the better.

Dealing with changes

Yes, it’s 2023 and we’re still talking about notebooks. 🐻 with me for a minute.

This is intended to provoke you! A little bit of trolling keeps attention high 📈

System vs Ad Hoc

There is some misconception about what is the role of DS/ML. First of all we create business value, thus, improving on a KPI related issue for our customers. Right?

Same goes for the rest of the engineering teams; FS build apps, backend create infrastructure and so forth. So if a FS creates a button using HTML with jQuery, then it doesn’t matter that’s a technology from 2000’s, as long as that button works. Right?
Same goes for backend engineer that writes a huge pile of code without breaking it into functions, and same goes for a data scientist developing in a huge notebook.

By looking how different coding paradigms were progressed in the last 40 years (talking from my perspective only), this is not how things are done.
Because backend/FS/frontend/game/system developers have learned that development is not a one time thing. Requirements will change with time, and you will have to deal with it, and when your feature will be shipped, those new requirements will be met with less time due to client expectations, catching you unprepared and juggling a few other things as well.

No matter in which software paradigm you’re working in, if you write in an unmaintainable way (ad hoc), this will backfire at you, causing you and your team anguish (or a growth opportunity).

DS “Experiment”

First — what is an experiment?

Input data — data, hyper parameters, configurations, features
Outputs — raw results, calculated metrics, models, charts

Most experiments will produce unsatisfying results, hence we should track only a handful of successful experiments, even though a failed experiment is very interesting at its own.

So far, a notebook can totally suffice. You read some .csv, transform into something, train a model, calculate metrics and that’s it.

But if you want to track several experiments, or even save them and share between the team?
Aren’t W&B or Neptune.ai enough as an experiment tracker?
Just send your dashboard over and we’ll talk about your metrics.
But what if another member wants to run the exact same thing, working on a different aspect of the project? Or you might need to go to a vacation and somebody needs to rerun your notebooks while you’re away? What if input has changed?

All of those requirements weren’t met in the “experiment” definition, but they will be will be defined in the next paragraph.

DS “Experimentation System” philosophy

As I’ve mentioned before, going by the “I want..” meme, we’re trying to build a system. I hope that I have convinced you with the previous paragraphs why systems, or proficient software paradigms are better than “I’m just sending notebooks in slack to the eng team” ad-hoci attitude.

Here’s a list of our requirements for an experimentation system (some of the software inspirations come from clean code):

Reproducible — given same input — you get same output
Robust to variations in the input — if data changes, code doesn’t (open/closed principle from software engineering)
Modular — pre processing, and training are separate modules
There is an explicit main pipeline (pre processing, training, saving)
Code changes are tracked — Using .git for code changes
Persistent output storage — you upload outputs to your cloud provider
You can go to vacation — Every ML engineer can run this without you
Ability to work in a team — multiple people can contribute code to this project
Tracking and sharing specific experiments — Another engineer should be able to poke at the experiments you chose to share
Interactive outputs — you have a way for others to interact with the model (even before production)
Consistent error analysis — your experiment system marks some miss-predictions automatically for you to check

Even though it’s possible to work with notebook based experimentation system, it requires a lot of heavy lifting, even for one of the most basic requirements like tracking code changes. Modularity is also possible but mostly forgotten when using a monolith notebook. Can you go to vacation when you have local notebooks and nobody knows where they are? Can others contribute to your pre processing methods?

Why those matter? How to achieve them? How do they relate to the anti-patterns mentioned?

I will share a story from my own experience.

Personal anti patterns

I was working at a certain company as an ML engineer, making a huge progress, developing all of the POC’s that we needed. After a year, demand grew and we had to bring in more people to help out. Me as the first ML engineer had to welcome everyone and introduce them with the projects.
Up until that point I didn’t have the need to share code or working in a team, and to be honest I wasn’t prepared properly for this. Fortunately since day one I was committing code into .git and the code was structured to modules. Without this, it would be much more painful.

Having said that, when another developer had to improve a model based on my work, he couldn’t find the “data” I was working on, this wasn’t surprising, because I was working locally. This need sprung the move to saving everything in S3, cleaning and improving our infrastructure.

Another need was consistency.
Each project had a lot of configurations, hyper parameters, different features etc. It meant that nobody else except the lead maintainer can produce a new model or just experiment.
We tackled this challenge by converting implicit expectations to explicit pipelines by introducing “python-invoke” methods that are run by the command line. Afterwards each project had 1 main way to run/build it.
This change helped tremendously with collaboration. Engineers could go to vacations and their project could be used in the meantime.

Infrastructure became enabler of cooperation.

Since then, this became my default operating mode.
I tend to refine it given the numerous new libraries and methodologies that are coming out. But that’s always the the case, philosophy remains even though implementation changes.

In the next episode we will iron out the details:
1. What is our project structure?
2. How did we implement pipelines?
3. Where do we store inputs/outputs?

And much more!