Question

Different embeddings for same sentences with torch transformer

Hey all and apologies in advance for what is probably a fairly basic question - I have a theory about what's causing the issue here, but would be great to confirm with people who know more about this than I do.

I've been trying to implement this python code snippet in Google colab. The snippet is meant to work out similarity for sentences. The code runs fine, but what I'm finding is that the embeddings and distances change every time I run it, which isn't ideal for my intended use case.

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("qiyuw/pcl-bert-base-uncased")
model = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")

# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

I think the issue must be model specific since I receive the warning about newly initialized pooler weights, and pooler_output is ultimately what the code reads to inform similarity:

Some weights of RobertaModel were not initialized from the model checkpoint at qiyuw/pcl-roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Switching to an alternative model which does not give this warning (for example, sentence-transformers/all-mpnet-base-v2) makes the outputs reproducible, so I think it is because of the above warning about initialization of weights. So here are my questions:

Can I make the output reproducible by initialising/seeding the model differently?
If I can't make the outputs reproducible, is there a way in which I can improve the accuracy to reduce the amount of variation between runs?
Is there a way to search huggingface models for those which will initialise the pooler weights so I can find a model which does suit my purposes?

Thanks in advance

3 75 3

1 Jan 1970

Solution

You are correct the model layer weights for bert.pooler.dense.bias and bert.pooler.dense.weight are initialized randomly. You can initialize these layers always the same way for a reproducible output, but I doubt the inference code that you have copied from there readme is correct. As already mentioned by you the pooling layers are not initialized and their model class also makes sure that the pooling_layer is not added:

...
self.bert = BertModel(config, add_pooling_layer=False)
...

The evaluation script of the repo should be called, according to the readme with the following command:

python evaluation.py --model_name_or_path qiyuw/pcl-bert-base-uncased --mode test --pooler cls_before_pooler

When you look into it, your inference code for qiyuw/pcl-bert-base-uncased should be the following way:

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("qiyuw/pcl-bert-base-uncased")
model = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")

# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.inference_mode():
    embeddings = model(**inputs)
    embeddings = embeddings.last_hidden_state[:, 0]

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

Output:

Cosine similarity between "There's a kid on a skateboard." and "A kid is skateboarding." is: 0.941
Cosine similarity between "There's a kid on a skateboard." and "A kid is inside the house." is: 0.779

Can I make the output reproducible by initialising/seeding the model differently?

Yes, you can. Use torch.maunal_seed:

import torch
from transformers import AutoModel, AutoTokenizer

model_random = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")
torch.manual_seed(42)
model_repoducible1 = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")

torch.manual_seed(42)
model_repoducible2 = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")

print(torch.allclose(model_random.pooler.dense.weight, model_repoducible1.pooler.dense.weight))
print(torch.allclose(model_random.pooler.dense.weight, model_repoducible2.pooler.dense.weight))
print(torch.allclose(model_repoducible1.pooler.dense.weight, model_repoducible2.pooler.dense.weight))

Output:

False
False
True

2024-07-04

cronoik

Solution

I doubt this is model specific. Pre trained embedding models are known to produce slightly different embeddings for the same input (in your case, texts).

There are a few open conversations about this, however I have experimented with this from a blank as well as pre-trained RoBerta model and found that setting model seeds for embedding tasks controls this behavior. However, adjusting the seed for model parameters (like bias) may not be enough, indeed, some embedding models can sample from a distribution of tokens at the end of it's inference process.

The warning message:

Some weights of RobertaModel were not initialized from the model checkpoint at qiyuw/pcl-roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

implies that the weights of the pretrained model were not downloaded and used in initialization. Depending on the initialization process of that particular Bert Model, it's very likely you're generating a new embeddings upon initialization.

Furthermore, do not initialize the model every time you run the script, I suggest you use a Jupyter Notebook to create a cell block, initialize the model once, and observe if you get the same embeddings for the same input value before moving onto something more elaborate like trying to control the seeds whatever layers are using them.

As an example, observe the following code for the Embedding layer from pytorch docs:

from torch import nn
embedding = nn.Embedding(10, 3)
input = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 9]])
embedding(input)

If the above code is ran (with no regard to whether you already initialized embedding), you will get new embeddings that have been generated from a normal distribution N(0,1). If you do not reinitialize embeddings you will see that the generated value of input has not changed no matter how many times you run it.

Indeed, the Bert model you're using above employs a similar (slightly more complex) use of the embedding layer.

2024-07-01

GKE