Error analysis using TF-IDF and common sense

Serj Smorodinsky
6 min readFeb 19, 2022

Audience for this post: ML/DL practitioners that are familiar with NLP and classification task. Error analysis is definitely worth for all modalities such as vision and audio but my main focus is in NLP and text. You can find equivalent methods for error analysis in those modalities also.

A lot of posts are written about different modern NLP architecture models such as BERT (no post on NLP is complete without some reference to this Transformer based model), T5, GPT-234, BART the deceased ELMO and more. Every week there is some another awesome idea in a research paper that you’re ever eager to cover. But paraphrasing an Israeli artist, Mooki, “every one is talking about model interpretability but nobody is talking about dataset interpretability and dataset cleaning” (the original quote is “every one is talking about peace but no one is talking about justice”).

What do I mean by that?

Let’s ponder about a scenario, you’re given a task by the product manager to classify animal type from a text sentence (yeah, it’s weird but I’m making a point here). It’s your time to shine, doing a literature review about the cutting edge text classification, calling favours from your friends at National Geographic for an expert to discuss about the challenges of detecting sub types of Pug dogs (there are about 4 last time I googled). The part of assembling a dataset is done for you by the magic elves working with you at your startup and you are handed a .csv with documents and their labels. And then finally you choose a CountVectorizer features+ RandomForest classifier as a baseline, because you want to prove everyone that good ol’ bag of words is okay.

You’re excited about augmentations, lots of options there. Doing another literature review to make sure nothing changes under the landscape since last several weeks ago. Not implementing anything yet but still, you’re waiting for that glorious moment that you can try insert random words, replace synonyms and maybe add antonyms or even translate to another language and translate back to your dataset’s language (that’s called back-translation), maybe balancing some imbalanced class or just overall improving the generalisation of your model.

The scores are here, after 10 epochs your classification report on the test set are as follows:

Totally made up classification report
Totally made up confusion matrix — don’t even try

And then you deploy to production.

Just kidding you go over the errors of course!

You consider the metrics that were suggested by the product team, and the priorities you are handled to take the path towards understanding what’s your model not understanding.
Let’s just hope it’s all tagging errors, right?

Yup, I’m doing memes for you

Error analysis

Errors .csv file

First error

First error

Jackpot, let’s reveal the first error. It’s a shining tagging error, someone erroneously tagged this document as a duck (if it walks like a duck and talks like a duck, then it’s a wolf). So now you’re up to:

  1. Find out what process mistakenly labelled this as a duck.
    1. Is it an automatic procedure regarding web scraping, scraping text and labels?
    2. Maybe the text is an output of ASR model and the ASR transcribed this part incorrectly? (I’ll go into this use case in another post about handling spoken dataset)
  2. Change the label (or ask someone to retag this document if you’re lazy)

Move on to the next error? NOPE!

Are there any more wrongly labelled samples? How can we answer this question without going over all of the samples?

Use text similarity and search similar documents that might have mistaken labels of course.

TF-IDF

We’re going to use cosine similarity as our similarity metric, basically it measure the angle between the vectors, disregarding the magnitude of the vectors. If the angle is the same, the score is 1, and if it’s opposite it’s -1. For each error we can search the most similar documents in the dataset in order to surface other unknown issues with the dataset.

If you’re using TF-IDF vectors as input features you already have the dataset’s vectors lying around somewhere. I can suggest to redo the process in a Jupyter Notebook

Loading the data

Loading the data

Defning methods

Running the search

After running the code you find out there are more samples that were labelled wrongly as “Duck” and you change them also. Awesome, you’re a step closer to a clean dataset.

Second error

After utilising the previous method of text similarity you don’t find something meaningful, but you have a hunch something fishy is here. Let’s try and search for “fricking”. Wow, you discovered that all of your dogs documents are filled with curse words.

Someone has a dirty mouth and expressed his feelings about dogs and that signal has led the classifier to choose the “cat” label over this sample. You think I’m making this up? I actually am, but something very similar happened to me.

I’m calling this error type as a narrow sample size. Let’s say we have only several “Cat” documents which is not enough to counter the weight of the specific profane words from the dog class.

Another way to diagnose this issue is looking at the attention weights (if you’re using a transformer architecture) generated by the model and discover that the attention is mostly on the “fricking” word.

How to solve this issue?

  1. Try under sample dogs.
    Maybe there is some needless repetition of profane dog samples that don’t contribute much
  2. Find more cat samples
  3. Augment all of the samples with curse words for fun and profit
    That’s what I did, and I will write about it in the following post 🙂

Third error

That’s a pretty short sample, maybe you should take it to the product team? Are we supposed to predict those short sample? It’s a challenge to predict with that limited context, that’s what you’re saying in a room full of eyebrowing colleagues at the weekly error analysis meeting (if you don’t have one, you should, other stakeholders should know how the model is behaving).

Let’s just signal to the team that short samples aren’t predictable but let’s not discard them all, okay? The product answers and you follow through.

So what have we learned over this session?

Summary

A friend of mine that I had the pleasure coding along with once said that a bug is a gift because it signals you that there’s a part of the system that you don’t understand — yet.
Well model errors can be like that, you can find amazing things like wrong labels due to problem misunderstanding (is that really a duck?), concept drifts (urban ducks are just like wolfs), narrow sampled data (why do all of those ducks have glasses?) and all of that by using text similarity and searching for more observations just like your errors. You can even automate this procedure with a few lines of code.

Hope you had enjoyed this,
next post I’ll write about the time I developed a profane augmentation technique (true story) to improve accuracy in a peculiar case.

--

--

Serj Smorodinsky

NLP Team Leader at Loris.ai. NLP|Neuroscience|Special Eduction|Literature|Software Engineering|. Let’s talk about it!