Big Data

Creating BERT Embeddings with Hugging Face Transformers


Transformers were initially created to change the text from one language into another. BERT greatly impacted how we study and work with human language. It improved the part of the original transformer model that understands the text. Creating BERT embeddings is especially good at grasping sentences with complex meanings. It does this by examining the whole sentence and understanding how words connect. The Hugging Face transformers library is key in creating unique sentence codes and introducing BERT.

Learning Objectives

  • Get a good grasp of BERT and pretrained models. Understand how important they are in working with human language.
  • Learn how to use the Hugging Face Transformers library effectively. Use it to create special representations of text.
  • Figure out various ways to correctly remove these representations from pretrained BERT models. This is important because different language tasks need different approaches.
  • Get hands-on experience by actually doing the steps needed to create these representations. Make sure you can do it on your own.
  • Learn how to use these representations you’ve created to improve other language tasks like sorting text or figuring out emotions in text.
  • Explore adjusting pretrained models to work even better for specific language tasks. This can lead to better results.
  • Find out where these representations are used to make language tasks work better. See how they improve the accuracy and performance of language models.

This article was published as a part of the Data Science Blogathon.

What do Pipelines Entail Inside the Context of Transformers?

Think of pipelines as a user-friendly tool that simplifies the complex code found in the transformers library. They make it easy for people to use models for tasks like understanding language, analyzing sentiments, extracting features, answering questions, and more. They provide a neat way to interact with these powerful models.

BERT Embeddings | Hugging Face Transformers

Pipelines include a few essential components: a tokenizer (which turns regular text into smaller units for the model to work with), the model itself (which makes predictions based on the input), and some extra preparation steps to ensure the model works well.

What Necessitates the Use of Hugging Face Transformers?

Transformer models are usually huge, and handling them for training and using them in real applications can be quite complex. Hugging Face transformers aim to make this whole process simpler. They provide a single way to load, train, and save any Transformer model, no matter how huge. Using different software tools for different parts of the model’s life is even more handy. You can train it with one set of tools and then easily use it in a different place for real-world tasks without much hassle.

Advanced Features

  • These modern models are easy to use and give great results in understanding and generating human language and in tasks related to computer vision and audio.
  • They also help save computer processing power and are better for the environment because researchers can share their already-trained models, so others don’t have to train them all over again.
  • With just a few lines of code, you can pick the best software tools for each step of the model’s life, whether it’s training, testing, or using it for real tasks.
  • Plus, plenty of examples for each type of model make it easy to use them for your specific needs, following what the original creators did.

Hugging Face Tutorial

This tutorial is here to help you with the basics of working with datasets. The main aim of HuggingFace transformers is to make it easier to load datasets that come in different formats or types.

Exploring the Datasets

Usually, bigger datasets give better results. Hugging Face’s Dataset library has a feature that lets you quickly download and prepare many public datasets. You can directly get and store datasets using their names from the Dataset Hub. The result is like a dictionary containing all parts of the dataset, which you can access by their names.

A great thing about Hugging Face’s Datasets library is how it manages storage on your computer and uses something called Apache Arrow. This helps it handle even large datasets without using up too much memory.

You can learn more about what’s inside a dataset by looking at its features. If there are parts you don’t need, you can easily get rid of them. You can also change the names of labels to ‘labels’ (which Hugging Face Transformers models expect) and set the output format to different platforms like torch, TensorFlow, or numpy.

Language Translation

Translation is about changing one set of words into another. Making a new translation model from the beginning needs a lot of text in two or more languages. In this tutorial, we’ll make a Marian model better at translating English to French. It’s already learned a lot from a big collection of French and English text, so it’s had a head start. After we’re done, we’ll have an even better model for translation.

from transformers import pipeline

translator = pipeline("translation_en_to_fr")
translation = translator("What's your name?")
## [{'translation_text': "Quel est ton nom ?"}]

Zero-Shot Classification

This is a special way of sorting text using a model that’s been trained to understand natural language. Most text sorters have a list of categories, but this one can decide what categories to use as it reads the text. This makes it really adaptable, even though it might work a bit slower. It can guess what a text is about in around 15 different languages, even if it doesn’t know the possible categories beforehand. You can easily use this model by getting it from the hub.

Zero-shot classification | BERT Embeddings | Hugging Face Transformers

Sentiment Analysis

You create a pipeline using the “pipeline()” function in Hugging Face Transformers. This part of the system makes it easy to train a model for understanding sentiment and then use it to analyze sentiments using a specific model you can find in the hub.

Step 1: Get the right model for the task you want to do. For example, we’re getting the distilled BERT base model for classifying sentiments in this case.

chosen_model = "distilbert-base-uncased-finetuned-sst-2-english"
distil_bert = pipeline(task="sentiment-analysis", model=chosen_model)

As a result, the model is prepared to execute the intended task.


This model assesses the sentiment expressed within the supplied texts or sentences.

Question Answering

The question-answering model is like a smart tool. You give it some text, and it can find answers in that text. It’s handy for getting information from different documents. What’s cool about this model is that it can find answers even if it doesn’t have all the background information.

You can easily use question-answering models and the Hugging Face Transformers library with the “question-answering pipeline.”

If you don’t tell it which model to use, the pipeline starts with a default one called “distilbert-base-cased-distilled-squad.” This pipeline takes a question, and some context related to the question and then figures out the answer from that context.

from transformers import pipeline

qa_pipeline = pipeline("question-answering")
query = "What is my place of residence?"
qa_result = qa_pipeline(question=query, context=context_text)
## {'answer': 'India', 'end': 39, 'score': 0.953, 'start': 31}

BERT Word Embeddings

Using the BERT tokenizer, creating word embeddings with BERT begins by breaking down the input text into its individual words or parts. Then, this processed input goes through the BERT model to produce a sequence of hidden states. These states make word embeddings for each word in the input text. This is done by multiplying the hidden states with a learned weight matrix.

What’s special about BERT word embeddings is that they understand the context. This means the embedding of a word can change depending on how it’s used in a sentence. Other methods for word embeddings usually create the same embedding for a word, no matter where it appears in a sentence.

BERT Word embeddings

What’s the Reason for Employing BERT Embeddings?

BERT, short for “Bidirectional Encoder Representations from Transformers,” is a clever system for training language understanding. It creates a solid foundation that can be used by people working on language-related tasks without any cost. These models have two main uses: you can use them to get more helpful information from your text data, or you can fine-tune them with your data to do specific jobs like sorting things, finding names, or answering questions.

It becomes instrumental once you put some information, like a sentence, document, or image, into BERT. BERT is great at pulling out important bits from text, like the meanings of words and sentences. These bits of information are helpful for tasks like finding keywords, searching for similar things and getting information. What’s special about BERT is that it understands words not just on their own but in the context they’re used in. This makes it better than models like Word2Vec, which don’t consider the words around them. Plus, BERT can handle the position of words really well, which is important.

Loading Pre-Traind BERT

Hugging Face Transformers allows you to use BERT in PyTorch, which you can install easily. This library also has tools to work with other advanced language models like OpenAI’s GPT and GPT-2.

!pip install transformers

You must bring in PyTorch, the pre-trained BERT model, and a BERT Tokenizer to get started.

import torch
from transformers import BertTokenizer, BertModel

Transformers provide different classes for using BERT in many tasks, like understanding the type of tokens and sorting text. But if you want to get word representations, BertModel is the best choice.

# OPTIONAL: Enable the logger for tracking information
import logging

import matplotlib.pyplot as plt
%matplotlib inline

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Load the tokenizer for the pre-trained model

Input Formatting

When working with a pre-trained BERT model for understanding human language, it’s crucial to ensure your input data is in the right format. Let’s break it down:

  1. Special Tokens for Sentence Boundaries: BERT needs your input to be like a series of word or subword pieces, like breaking a sentence into smaller parts. You must add special tokens at the start and end of each sentence.
  2. Keeping Sentences the Same Length: To effectively work with a bunch of input data, you must ensure all your sentences are the same length. You can do this by adding extra “padding” tokens to shorter sentences or cutting down longer ones.
  3. Using an Attention Mask: When you add padding tokens to make sentences the same length, you also use an “attention mask.” This is like a map that helps BERT know which parts are actual words (marked as 1) and which are padding (marked as 0). This mask is included with your input data when you give it to the BERT model.

Special Tokens

Here’s what these tokens do in simpler terms:

  1. [SEP] Separates Sentences: Adding [SEP] at the end of a sentence is crucial. When BERT sees two sentences and needs to understand their connection, [SEP] helps it know where one sentence ends and the next begins.
  2. [CLS] Shows the Main Idea: For tasks where you classify or sort text, starting with [CLS] is common. It signals to BERT that this is where the main point or category of the text is.

BERT has 12 layers, each creating a summary of the text you give it, with the same number of parts as the words you put in. But these summaries are a bit different when they come out.

Special Tokens | BERT Embeddings | Hugging Face Transformers


The ‘encode’ function in the Hugging Face Transformers library prepares and organises your data. Before using this function on your text, you should decide on the longest sentence length you want to use for adding extra words or cutting down longer ones.


How to Tokenize Text?

The tokenizer.encode_plus function streamlines several processes:

  1. Segmenting the sentence into tokens
  2. Introducing special [SEP] and [CLS] tokens
  3. Mapping tokens to their corresponding IDs
  4. Ensuring uniform sentence length through padding or truncation
  5. Crafting attention masks that distinguish actual tokens from [PAD] tokens.
input_ids = []
attention_masks = []

# For each sentence...
for sentence in sentences:
    encoded_dict = tokenizer.encode_plus(
                        add_special_tokens=True,   # Add '[CLS]' and '[SEP]'
                        max_length=64,             # Adjust sentence length
                        pad_to_max_length=True,    # Pad/truncate sentences
                        return_attention_mask=True,# Generate attention masks
                        return_tensors="pt",       # Return PyTorch tensors
    # Construct an attention mask (identifying padding/non-padding).

Segment ID

In BERT, we’re looking at pairs of sentences. For each word in the tokenized text, we determine if it belongs to the first sentence (marked with 0s) or the second sentence (marked with 1s).

Segment ID | BERT Embeddings | Hugging Face Transformers

When working with sentences in this context, you give a value of 0 to every word in the first sentence along with the ‘[SEP]’ token, and you give a value of 1 to all the words in the second sentence.

Now, let’s talk about how you can use BERT with your text:

The BERT Model learns complex understandings of the English language, which can help you extract different aspects of text for various tasks.

If you have a set of sentences with labels, you can train a regular classifier using the information produced by the BERT Model as input (the text).

To obtain the features of a particular text using this model in TensorFlow:

from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertModel.from_pretrained("bert-base-cased")

custom_text = "
You are welcome to utilize any text of your choice."
encoded_input = tokenizer(custom_text, return_tensors="tf")
output_embeddings = model(encoded_input)


BERT is a powerful computer system made by Google. It’s like a smart brain that can learn from a text. You can make it even smarter by teaching it specific tasks, like figuring out what a sentence means. On the other hand, HuggingFace is a famous and open-source library for working with language. It gives you pre-trained BERT models, making it much easier to use them for specific language jobs.

Key Takeaways

  • In simple terms, using word representations from pretrained BERT models is incredibly useful for a wide range of natural language tasks like sorting text, figuring out feelings in text, and recognizing the names of things.
  • These models have already learned a lot from big data sets, and they tend to work well for various tasks.
  • You can make them even better for specific jobs by adjusting the knowledge they’ve already gained.
  • What’s more, getting these word representations from the models helps you use what they’ve learned in other language tasks, and it can make other models work better. All in all, using pretrained BERT models for word representations is an auspicious approach to language processing.

Frequently Asked Questions

Q1. What is a Hugging Face transformer?

A. Hugging Face Transformer is like a platform that gives people access to advanced, ready-to-use computer models. You can find these models on the Hugging Face website.

Q2. What defines a pre-trained transformer?

A. A pretrained transformer is an intelligent computer program trained and checked by people or companies. These models can be used as a starting point for similar tasks.

Q3. Is Hugging Face available for free?

A. Hugging Face has two versions: one for regular folks and another for organizations. The regular one has a free option with some limits and a pro version that costs $9 monthly. Organizations get access to Lab and business solutions, which aren’t free.

Q4. Which frameworks are supported by Hugging Face?

A. Hugging Face provides tools for about 31 different computer programs. Most of them are used for deep learning, like PyTorch, TensorFlow, JAX, ONNX, fastai, Stable-Baseline 3, and more.

Q5. Which programming languages are employed by Hugging Face?

A. Some of these pretrained models have been trained to understand multiple languages, and they can work with programming languages like JavaScript, Python, Rust, and Bash/Shell. If you’re interested in this, you might want to take a Python Natural Language Processing course to learn how to clean up text data effectively.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.