Question
SQL DOT_PRODUCT function not recognized (possibly type casting error?)
I have been trying to edit a SQL query in a friend's code in order for it to do a Dot product between two embedding vectors: a query vector and a vector it retrieves from the Postgres database. This database stores paragraphs from different documents, and each row has not only the embedding for the specific paragraph, but the page number, document name, document link and the original paragraph itself.
The original query, which use the Null safe operator to do a comparison between the embeddings as strings, looks like this:
SELECT content, document, page, line, link
FROM documents
ORDER BY embedding <=> %(embedding)s
LIMIT %(n)s
For reference, the full section of code looks like this:
string = query
# instantiate sentence transformer for embedding
model = SentenceTransformer('all-MiniLM-L6-v2')
# embed the input string
embedding = model.encode(string)
# establish connection to database
conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)
conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
# create embedding array for pgvector
embedding_pgarray = "[" + ",".join(map(str, embedding)) + "]"
# Execute a SQL query to find the top 5 vectors closest to the embedding vector
query = """
SELECT content, document, page, line, link
FROM documents
ORDER BY embedding <=> %(embedding)s
LIMIT %(n)s
"""
neighbors = conn.execute(query, {'embedding': embedding_pgarray, 'n':n}).fetchall()
# organize query results into dataframe
df = pd.DataFrame()
for neighbor in neighbors:
content = neighbor[0].strip()
document = neighbor[1]
page = neighbor[2]
#print("Page: ", page)
line = neighbor[3]
link = neighbor[4]
df1 = pd.DataFrame({'Content':content, 'Document':document,'Page':page, 'Line':line, 'Link':link}, index=[0])
df = pd.concat((df, df1))
df = df.reset_index(drop=True)
return df
I edited the SQL query to look like this:
query = """
SELECT content, document, page, line, link
FROM documents
ORDER BY DOT_PRODUCT(embedding, embedding)
LIMIT %(n)s
"""
When trying to use DOT_PRODUCT()
, I seem to get an error saying that this function is not recognized:
UndefinedFunction: function dot_product(vector, vector) does not exist
LINE 4: ORDER BY DOT_PRODUCT(embedding, embedding)
^
HINT: no function matches the given name and argument types. You might need to add explicit type casts.
Note: I should probably also include the code where the object is inserted into the database to show what the columns are:
conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)
conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
register_vector(conn)
# replace old data
conn.execute('DROP TABLE IF EXISTS documents')
conn.execute('CREATE TABLE documents (id bigserial PRIMARY KEY, line integer, word text, document text, content text, embedding vector(384), link text, page text)')
model = SentenceTransformer('all-MiniLM-L6-v2')
for i in range(len(context_list)):
doc_name = []
line = []
words = []
link = []
page = []
for j in range(len(context_list[i])):
doc_name.append(file_names[i])
line.append(j)
page.append('N/A')
words.append(key_words[i])
link.append(links[i])
embeddings = model.encode(context_list[i])
for content, embedding, word, doc_name, line, link, page in zip(context_list[i], embeddings, words, doc_name, line, link, page):
conn.execute('INSERT INTO documents (line, word, document, content, embedding, link, page) VALUES (%s, %s, %s, %s, %s, %s, %s)', (line, word, doc_name, content, embedding, link, page))
I have tried different variations of the DOT_PRODUCT
including VECTOR_COSINE
. I've also tried different type casts for the embedding objects. All I can think is that the type of the vector for one of the objects is off.
I have checked the type for both, one is a vector of Float32 objects, the other a vector of int. I am not sure how to change the object type for a vector within a SQL query itself, and even then, I am not sure if that is the problem with DOT_PRODUCT
that I am experiencing. I have also made sure I have the right extensions for DOT_PRODUCT
, as I believe the extension required for that is vector.
Any ideas or other things I can try?