You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am experimenting with this extension to implement clustering based on the embeddings generated by a LLM for short text strings.
Consider the following table (at the moment it is filled with around 100K rows, but I'll need to scale up to 100M rows).
CREATETABLE "items" (
id BLOB,
my_text VARCHAR,
embedding FLOAT32[3072]
);
CREATEINDEXitems_embedding_indexON'items' USING HNSW (embedding) WITH (metric ='cosine')
My goal is to merge different items into clusters and save them in a table like the following where items sharing the same cluster_id belongs to the same cluster.
In my use case clusters correspond to the connected components of the graph having the items as vertices and edges that joins two vertices whenever the cosine distance of their embeddings is less than a given threshold (e.g. 0.15).
Once the graph is built, computing the connected components requires almost linear time using the Union-Find algorithm.
I am having troubles to build a query able to compute the edges of the graph fast enough.
The kind of query I need to run is as follows:
SELECT i.*, j.*FROM items AS i
INNER JOIN items AS j ON array_cosine_distance(i.embedding, j.embedding) <0.15ANDa.id<b.id
However, the query does no use the HSNW index, resulting in two sequential scans that lead to huge execution times:
I am experimenting with this extension to implement clustering based on the embeddings generated by a LLM for short text strings.
Consider the following table (at the moment it is filled with around 100K rows, but I'll need to scale up to 100M rows).
My goal is to merge different items into clusters and save them in a table like the following where items sharing the same
cluster_id
belongs to the same cluster.In my use case clusters correspond to the connected components of the graph having the items as vertices and edges that joins two vertices whenever the cosine distance of their embeddings is less than a given threshold (e.g. 0.15).
Once the graph is built, computing the connected components requires almost linear time using the Union-Find algorithm.
I am having troubles to build a query able to compute the edges of the graph fast enough.
The kind of query I need to run is as follows:
However, the query does no use the HSNW index, resulting in two sequential scans that lead to huge execution times:
As far as I understand, this case is not yet handled by the vss extension.
Here are my questions:
Thanks a lot!
The text was updated successfully, but these errors were encountered: