Question

SQL DOT_PRODUCT function not recognized (possibly type casting error?)

I have been trying to edit a SQL query in a friend's code in order for it to do a Dot product between two embedding vectors: a query vector and a vector it retrieves from the Postgres database. This database stores paragraphs from different documents, and each row has not only the embedding for the specific paragraph, but the page number, document name, document link and the original paragraph itself.

The original query, which use the Null safe operator to do a comparison between the embeddings as strings, looks like this:

 SELECT content, document, page, line, link 
 FROM documents 
 ORDER BY embedding <=> %(embedding)s 
 LIMIT %(n)s

For reference, the full section of code looks like this:

string = query

# instantiate sentence transformer for embedding
model = SentenceTransformer('all-MiniLM-L6-v2')

# embed the input string
embedding = model.encode(string)

# establish connection to database
conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)
conn.execute('CREATE EXTENSION IF NOT EXISTS vector')

# create embedding array for pgvector
embedding_pgarray = "[" + ",".join(map(str, embedding)) + "]"

# Execute a SQL query to find the top 5 vectors closest to the embedding vector
query =  """ 
    SELECT content, document, page, line, link
    FROM documents
    ORDER BY  embedding <=> %(embedding)s
    LIMIT %(n)s
"""

neighbors = conn.execute(query, {'embedding': embedding_pgarray, 'n':n}).fetchall()

# organize query results into dataframe
df = pd.DataFrame()
for neighbor in neighbors:
    content = neighbor[0].strip()
    document = neighbor[1]
    page = neighbor[2]
    #print("Page: ", page)
    line = neighbor[3]
    link = neighbor[4]

    df1 = pd.DataFrame({'Content':content, 'Document':document,'Page':page, 'Line':line, 'Link':link}, index=[0])
    df = pd.concat((df, df1))

df = df.reset_index(drop=True)
return df   

I edited the SQL query to look like this:

query =  """ 
    SELECT content, document, page, line, link
    FROM documents
    ORDER BY DOT_PRODUCT(embedding, embedding) 
    LIMIT %(n)s
"""

When trying to use DOT_PRODUCT(), I seem to get an error saying that this function is not recognized:

UndefinedFunction: function dot_product(vector, vector) does not exist
LINE 4: ORDER BY DOT_PRODUCT(embedding, embedding)
^
HINT: no function matches the given name and argument types. You might need to add explicit type casts.

Note: I should probably also include the code where the object is inserted into the database to show what the columns are:

conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)
conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
register_vector(conn)

# replace old data
conn.execute('DROP TABLE IF EXISTS documents')
conn.execute('CREATE TABLE documents (id bigserial PRIMARY KEY, line integer, word text, document text, content text, embedding vector(384), link text, page text)')

model = SentenceTransformer('all-MiniLM-L6-v2')

for i in range(len(context_list)):
    doc_name = []
    line = []
    words = []
    link = []
    page = []
    for j in range(len(context_list[i])):
        doc_name.append(file_names[i])
        line.append(j)
        page.append('N/A')
        words.append(key_words[i])
        link.append(links[i])
    embeddings = model.encode(context_list[i])

    for content, embedding, word, doc_name, line, link, page in zip(context_list[i], embeddings, words, doc_name, line, link, page):
        conn.execute('INSERT INTO documents (line, word, document, content, embedding, link, page) VALUES (%s, %s, %s, %s, %s, %s, %s)', (line, word, doc_name, content, embedding, link, page))

I have tried different variations of the DOT_PRODUCT including VECTOR_COSINE. I've also tried different type casts for the embedding objects. All I can think is that the type of the vector for one of the objects is off.

I have checked the type for both, one is a vector of Float32 objects, the other a vector of int. I am not sure how to change the object type for a vector within a SQL query itself, and even then, I am not sure if that is the problem with DOT_PRODUCT that I am experiencing. I have also made sure I have the right extensions for DOT_PRODUCT, as I believe the extension required for that is vector.

Any ideas or other things I can try?

 4  51  4
1 Jan 1970

Solution

 1

In short: there's no function named dot_product() in pgvector. Use <#>, the inner product distance operator.

You can query system catalogs pg_proc, pg_depend and pg_extension to check if you actually have a specific function, what extension it belongs to and which schema/namespace it's in:
demo at db<>fiddle

select e.extnamespace::regnamespace
     , e.extname
     , p.pronamespace::regnamespace
     , p.oid::regprocedure as "function with argument types"
from pg_catalog.pg_proc      as p
join pg_catalog.pg_depend    as d on (d.objid    = p.oid)
join pg_catalog.pg_extension as e on (d.refobjid = e.oid)
where p.proname ilike 'dot_product'
order by 2,4;

In a standard PostgreSQL installation, as of version 16, this returns no results - there's no built-in dot_product function. This statement you ran:

CREATE EXTENSION IF NOT EXISTS vector

Most likely attempts to add pgvector extension (most likely, because nothing stops you from building a different extension and exposing it under the name vector). Because of if not exists it won't do anything if the extension had already been added, but if it's in a schema that's not in your search_path, it'll act as if it wasn't there unless you either schema-qualify all references to anything from that extension

select some_other_schema.st_astext('point(0 0)'::some_other_schema.geometry);

Or add that schema to your search path:

set search_path='some_other_schema';
select st_astext(st_geomfromtext('point(0 0)'));
2024-07-09
Zegarek