Hugging Face + Keras: a Modern Approach

Serj Smorodinsky
5 min readSep 28, 2022

Audience — Keras (DL Pyton framework) users that are interested to improve their Hugging Face integration, improving computation time, accuracy and simplifying code structure.

If you ever built a text classifier using HF + Keras, this is for you.

Yes this is the original quote from the movie.

Intro

Why did I even think of this? Because I’ve created an extension of Distilbert as a sequential model, but with fixed input. Doing that actually pressed me to have fixed input, with full padding up to 512. That is a lot of wasted computing cycles.

It just bugged me out, how did HF implemented Keras models without fixed input? I had to find out.

Also I might be one of the last people on earth that does DeepLearning using Keras, due to everyone jumping on the PyTorch train. But me as being faithful framework monogamist (Francois Chollet to blame), I had to find ways to adapt with the times.

This blog is the recollection of this interesting journey.

What Will You Learn

  1. All the different ways to create Keras HF model
    - Most efficient — by subclassing
  2. Solutions to problems that rise from subclassing
  3. Advanced understanding of HF Bert model implementation

Vocabulary

Model of choice Distilbert uncased with no specific reason except its popularity. It had 7.5M downloads just this month

Input dimension — Distilbert accepts up to 510 word pieces, to them a start token and an end token is added and we get to the known 512 number. For simplification we can think of words as word pieces as a useful analogy. It means that Distilbert can compute sentences with up to 510 words.

HF — Fugging Face, a framework to load SOTA pretrained models. Base models can be used to fine tune to your data.

Computation time — All computation time noted in “ms” was done on my CPU, I would take them as a note on a magnitude order instead of exact numbers.

Inspiration — this HF article that suggests this pathway but lacks full implementation and caveats overview

Keras background — the gods of Keras have given us 3 ways to implement Keras models. Sequential, functional and subclassing. Subclassing is rarer for most use cases and that’s what I will utilize. You can read this for a solid explanation of the differences.

Common Implementations Of TF/Keras + HF

  1. Standard — Use HF end to end with Keras being the trainer. HF Tutorial
  2. Intermediate — Use Keras API to build on top of HF classifier
Example of intermediate way — GlobalMaxPool you are my favorite layer

Why not use the standard way?

  1. This achieves the goal of dynamic input but binds you to what I think is over parametrized classifier.
  2. You need anything else rather the standard softmax classifier? Even adding multi label based sigmoid NN isn’t straight forward with HF.
  3. You want to improve accuracy — HF implements classifier by using the first vector of the last hidden state — the CLS vector (768 dimensions). Read this suggestion on how to improve (I prefer max pooling for instance)

Why not use the intermediate way?

  1. Long computation time — the common way is to define fixed input. This will require padding for all later samples no matter their length. Let’s say you have set dimension input 512, then it will take on average ~400–500ms for each sample. Even for a sentence with one word: “hey”, you have to pad it entirely in order to avoid network errors due to mismatch of shapes.
  2. Reduced accuracy due to truncation — In order to avoid high computation time, you might set the input dimensions as low as 128 and then have ~100ms inference. What about long sentences? You will have to truncate them, to 128, this has the potential to reduce accuracy. Imagine that you can read this article but only its 128 first words. This can potentially might hamper your understanding. And again — Short sentences unnecessarily padded to 128.

Subclassing Keras Model Class

Subclassing Keras Model is a way to create a neural network architecture without explicitly stating one. This is more advanced and less straightforward but also gives you the most flexibility. This is the juice that allowed HF to have non fixed input dimension!

Here’s an example from HF codebase

Here is my proof — tf.keras.Model. TFPreTrainedModel is inheriting a Keras Model

I’ve added a Gist for my own example — explanations in below

Keras model that accepts many types of HF models

Line by line

This is the same neural architecture as get_model that I added above (under common implementation) so I won’t go over layers.

Line 5: class HFKerasModel(tf.keras.Model) we inherit tf.keras.Model which gives our class pre defined methods such as default build , fit and more.

Line 11: self.base_model = TFAutoModel.from_pretrained(base_model_path) We can accept various HF models such as distilbert-base-cased, distilbert-base-uncased and more.

Lines 17–21: you have to call model in order for the architecture to be built.

Lines 23–25: loading weights if they were supplied

Line 27: we define the forward pass, training is a parameter because some layers behave differently depending on whether we are during training.

So how come we don’t need fixed input? Because we define our forward pass implicitly and neural network variables will be created lazily. This means a huge boost in performance! No need to pad everything to max.

Caveats

  1. Need to add a call method to class
  2. Need to call the model before predicting or loading weights
  3. Need to fit the model before summary
  4. Save model by model = HFKerasModel(targets=1); model.save_weights(save_path)

Summary

I started out with an observation, HF works with dynamic input which boosts performance. Understanding how HF achieves this feat has led me to this interesting rabbit hole which actually allows many more exciting opportunities such as having more custom logic inside my models. For instance, I can use 3–4 different different HF models and concatenate their outputs. Why should I do it? Maybe I want to have several modalities such as speech, text and vision.

Other than that, even for the original use case of a text classifier, I think it’s important to employ best standards and leverage HF to the fullest through Keras subclassing.

Hopefully you found this useful!

Shameless plug — if you’re interested in improving customer service in your company, visit www.loris.ai for more information.

--

--

Serj Smorodinsky

NLP Team Leader at Loris.ai. NLP|Neuroscience|Special Eduction|Literature|Software Engineering|. Let’s talk about it!