Using HNSW index in complex queries #43

lmores · 2024-12-27T11:35:06Z

I am experimenting with this extension to implement clustering based on the embeddings generated by a LLM for short text strings.

Consider the following table (at the moment it is filled with around 100K rows, but I'll need to scale up to 100M rows).

CREATE TABLE "items" (
    id BLOB,
    my_text VARCHAR,
    embedding FLOAT32[3072]
);
CREATE INDEX items_embedding_index ON 'items' USING HNSW (embedding) WITH (metric = 'cosine')

My goal is to merge different items into clusters and save them in a table like the following where items sharing the same cluster_id belongs to the same cluster.

CREATE TABLE "clusters" (
    cluster_id BLOB,
    item_id BLOB
);

In my use case clusters correspond to the connected components of the graph having the items as vertices and edges that joins two vertices whenever the cosine distance of their embeddings is less than a given threshold (e.g. 0.15).
Once the graph is built, computing the connected components requires almost linear time using the Union-Find algorithm.

I am having troubles to build a query able to compute the edges of the graph fast enough.
The kind of query I need to run is as follows:

SELECT i.*, j.*
FROM items AS i
INNER JOIN items AS j ON array_cosine_distance(i.embedding, j.embedding) < 0.15 AND a.id < b.id

However, the query does no use the HSNW index, resulting in two sequential scans that lead to huge execution times:

┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             id            │
│           my_text         │
│         embedding         │
│             id            │
│           my_text         │
│         embedding         │
│                           │
│        ~247067 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│   (array_cosine_distance  │
│ (embedding, embedding) < 0│
│            .15)           │
│                           │
│        ~247067 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    PIECEWISE_MERGE_JOIN   │
│    ────────────────────   │
│      Join Type: INNER     │
│    Conditions: id < id    ├──────────────┐
│                           │              │
│        ~247067 Rows       │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│           items           ││           items           │
│                           ││                           │
│        Projections:       ││        Projections:       │
│         embedding         ││         embedding         │
│             id            ││             id            │
│          my_text          ││          my_text          │
│                           ││                           │
│        ~98827 Rows        ││        ~98827 Rows        │
└───────────────────────────┘└───────────────────────────┘

As far as I understand, this case is not yet handled by the vss extension.
Here are my questions:

In theory, could this kind of query benefit from the existance of a HNSW index?
If so, do you plan to extend the implementation to cover this use case?

Thanks a lot!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using HNSW index in complex queries #43

Using HNSW index in complex queries #43

lmores commented Dec 27, 2024

Using HNSW index in complex queries #43

Using HNSW index in complex queries #43

Comments

lmores commented Dec 27, 2024