#1 Python ML Dev Tricks

Serj Smorodinsky
5 min readMay 15, 2022

Pip install like NPM|Symlink for Notebooks|Spacy en_core_web_sm

From time to time I collect little nifty tricks that I like to share. Most of the time the actual solution already exists somewhere in the interwebs, but the first step to the solution is acknowledging there is an actual problem, right? This blog beats my Evernote notes by far, making my tips public!

Take a seat because you’re just in time for your weekly code therapy session.

Jerry has issues, and so do you

Are you sick of installing dependencies through pip and then manually append the version number of your new shiny library to requirements.txt?

Do you have numerous Jupyter notebooks lying around but you’re afraid to hide them in a subfolder due to python’s import context breakage?

Have you ever installed Spacy and had to install separately en_core_web_sm? What if you need this capability in production?

1. NPM had it all along

Each community can have important lessons to learn from. A couple of years ago I dabbled in fullstack development, having Node.js and NPM to play with. NPM is Javascript’s pip, a.k.a a package manager.
NPM conventions are that there’s a package.json file that holds the various library dependency versions your project uses (a requirements.txt equivalent). Devs mostly don’t edit this file manually. So how do you update it then?

npm install library —-save

The save parameter appends the new library and its version to the package.json file. Pretty cool right?
There are other options, such saving a package as a development package that is not suitable in production and more.

How to do that with PIP?

Well you can’t. No parameter in pip documentation will edit your requirements.
But before you leave with a sunk face, here’s a little snippet I found on SO to achieve exactly that:
https://stackoverflow.com/a/49079378

Here’s the same snippet

Break down of the function

1. pip install $1 — $1 is a placeholder variable of the library name (that’s just the first command line parameter. Some bash scripting is always helpful :))

2. pip freeze — this will output all of the installed libraries

3. grep $1 — match the line with the library and version from the pip freeze output
4. >> requirements.txt — append grep results to the requirements.txt

How to actually use it?

Add it in your .zshrc file or your .bashrc depending on your favorite shell. This will allow you to call the function from your command line. In order to use it right ahead please do

source ~/.zshrc 

If you’re wondering why not just doing the following after each install?

pip freeze > requirements.txt

You can read here: https://medium.com/@tomagee/pip-freeze-requirements-txt-considered-harmful-f0bce66cf895

Also if you have any other suggestions or any other reservations, please share them with the rest of us in the comments section.

2. How to reuse internal modules in Notebooks?

This is not a debate about whether to use Jupyter notebooks or not. I think they are definitely worth using for data exploration and EDA, but I’m stopping at those use cases. Notebooks can also lead to a messy code and a messy project structure if you’re not careful, especially because it’s hard to reuse code from your internal modules inside a notebook. We will fix code reuse issues by altering the project structure via symlinks.

What is my preferred project structure? I couldn’t write it better myself (read it!). Here’s a shorthand of it:

app/ -> your python code
notebooks/
tests/
models/ -> models you create
data/ -> current data

Well you have your notebooks in your notebooks folder, so what’s the problem? Here’s the thing, if you want to import your code which resides at app/ you will have issues because python import scheme won’t let you.

Let’s say you have an awesome method under app/utils.py that does some pre processing on text.

You also have an EDA notebook that resides at notebooks/EDA.ipynb

Let’s start off jupyter notebook and try and import.

That’s a bummer

Why is that happening? Because app is in a folder which is a sibling of the notebooks folder. Well then, we can either move this notebook outside of the notebooks subfolder, so it will “see” the “app” package, or we can just move to the right context like so:

We can change the working directory and try again — it works

Or, we can have a symlink to the “app” package inside the notebooks folder!

Symbolic Link For The Rescue

What’s a symbolic link (symlink for a shorthand)? It’s similar to a file, and it serves as a pointer to another file/folder. This is a Linux perspective and everything in Linux is modeled as a file - here’s a wiki to prove it. Symlinks are pretty much everywhere and are an important concept to get familiar with.

How would one achieve this feat?

$ cd notebooks
$ ln -s ../app # this is the actual symlink creation

Now you’re able to have the same context as the main folder, but within a sub folder, without changing any line of code! Also, feel free to add additional symlinks to other important folders such as models/, data/ and more.

There are a couple caveats to this:

  1. IDE’s sometimes tend to show duplicate search results, both from app/ and notebooks/app
  2. Whether to commit these symlinks is a matter of preference and depends whether your dev team uses the same operating system for instance

As before, please share if you encountered with this issue and if you have an improvement upon this.

3. Spacy en_core_web_sm

I just realised that I know how to type en_core_web_sm by heart thanks to this piecee!
I know you don’t have much attention left so I’ll brief.
You install Spacy, and then try to use its POS tagging methods.

I’ve copied Spacy’s documentation for this example.

Yes, I forgot this step

$ python -m spacy download en_core_web_sm

Well no one is perfect, right? WRONG!

Why is this even an issue? How would you do it production? You can add this line to the Dockerfile. But there’s a way to do it by appending a line to requirements.txt just like god intended.

# this will download en_core_web_sm! https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz# other packages 

Just pip install -r requirements.txt and you’re good to go!

Summary

I hope you enjoyed it, please use all of the above with a grain salt, results may vary, and I’m not responsible in any way of any harm done. Having said that, these have really helped me out and now they are easier to find and share.

--

--

Serj Smorodinsky

NLP Team Leader at Loris.ai. NLP|Neuroscience|Special Eduction|Literature|Software Engineering|. Let’s talk about it!