Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using HNSW index in complex queries #43

Open
lmores opened this issue Dec 27, 2024 · 0 comments
Open

Using HNSW index in complex queries #43

lmores opened this issue Dec 27, 2024 · 0 comments

Comments

@lmores
Copy link

lmores commented Dec 27, 2024

I am experimenting with this extension to implement clustering based on the embeddings generated by a LLM for short text strings.

Consider the following table (at the moment it is filled with around 100K rows, but I'll need to scale up to 100M rows).

CREATE TABLE "items" (
    id BLOB,
    my_text VARCHAR,
    embedding FLOAT32[3072]
);
CREATE INDEX items_embedding_index ON 'items' USING HNSW (embedding) WITH (metric = 'cosine')

My goal is to merge different items into clusters and save them in a table like the following where items sharing the same cluster_id belongs to the same cluster.

CREATE TABLE "clusters" (
    cluster_id BLOB,
    item_id BLOB
);

In my use case clusters correspond to the connected components of the graph having the items as vertices and edges that joins two vertices whenever the cosine distance of their embeddings is less than a given threshold (e.g. 0.15).
Once the graph is built, computing the connected components requires almost linear time using the Union-Find algorithm.

I am having troubles to build a query able to compute the edges of the graph fast enough.
The kind of query I need to run is as follows:

SELECT i.*, j.*
FROM items AS i
INNER JOIN items AS j ON array_cosine_distance(i.embedding, j.embedding) < 0.15 AND a.id < b.id

However, the query does no use the HSNW index, resulting in two sequential scans that lead to huge execution times:

┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             id            │
│           my_text         │
│         embedding         │
│             id            │
│           my_text         │
│         embedding         │
│                           │
│        ~247067 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│   (array_cosine_distance  │
│ (embedding, embedding) < 0│
│            .15)           │
│                           │
│        ~247067 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    PIECEWISE_MERGE_JOIN   │
│    ────────────────────   │
│      Join Type: INNER     │
│    Conditions: id < id    ├──────────────┐
│                           │              │
│        ~247067 Rows       │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│           items           ││           items           │
│                           ││                           │
│        Projections:       ││        Projections:       │
│         embedding         ││         embedding         │
│             id            ││             id            │
│          my_text          ││          my_text          │
│                           ││                           │
│        ~98827 Rows        ││        ~98827 Rows        │
└───────────────────────────┘└───────────────────────────┘

As far as I understand, this case is not yet handled by the vss extension.
Here are my questions:

  1. In theory, could this kind of query benefit from the existance of a HNSW index?
  2. If so, do you plan to extend the implementation to cover this use case?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant