Question
Different embeddings for same sentences with torch transformer
Hey all and apologies in advance for what is probably a fairly basic question - I have a theory about what's causing the issue here, but would be great to confirm with people who know more about this than I do.
I've been trying to implement this python code snippet in Google colab. The snippet is meant to work out similarity for sentences. The code runs fine, but what I'm finding is that the embeddings and distances change every time I run it, which isn't ideal for my intended use case.
import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer
# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("qiyuw/pcl-bert-base-uncased")
model = AutoModel.from_pretrained("qiyuw/pcl-bert-base-uncased")
# Tokenize input texts
texts = [
"There's a kid on a skateboard.",
"A kid is skateboarding.",
"A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Get the embeddings
with torch.no_grad():
embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))
I think the issue must be model specific since I receive the warning about newly initialized pooler weights, and pooler_output is ultimately what the code reads to inform similarity:
Some weights of RobertaModel were not initialized from the model checkpoint at qiyuw/pcl-roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Switching to an alternative model which does not give this warning (for example, sentence-transformers/all-mpnet-base-v2) makes the outputs reproducible, so I think it is because of the above warning about initialization of weights. So here are my questions:
- Can I make the output reproducible by initialising/seeding the model differently?
- If I can't make the outputs reproducible, is there a way in which I can improve the accuracy to reduce the amount of variation between runs?
- Is there a way to search huggingface models for those which will initialise the pooler weights so I can find a model which does suit my purposes?
Thanks in advance