Skip to content

Commit

Permalink
Add parent join support for faiss hnsw (opensearch-project#1398)
Browse files Browse the repository at this point in the history
* Add patch to support multi vector in faiss (opensearch-project#1358)

Signed-off-by: Heemin Kim <[email protected]>

* Initialize id_map as null (opensearch-project#1363)

Signed-off-by: Heemin Kim <[email protected]>

* Add support of multi vector in jni (opensearch-project#1364)

Signed-off-by: Heemin Kim <[email protected]>

* Multi vector support for Faiss HNSW (opensearch-project#1371)

Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields.

Signed-off-by: Heemin Kim <[email protected]>

* Add data generation script for nested field (opensearch-project#1388)

Signed-off-by: Heemin Kim <[email protected]>

* Add perf test for nested field (opensearch-project#1394)

Signed-off-by: Heemin Kim <[email protected]>

---------

Signed-off-by: Heemin Kim <[email protected]>
(cherry picked from commit 709b448)
  • Loading branch information
heemin32 committed Jan 19, 2024
1 parent 7900dbb commit 8e5005d
Show file tree
Hide file tree
Showing 46 changed files with 3,333 additions and 99 deletions.
9 changes: 9 additions & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,15 @@ jobs:
with:
submodules: true

# Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here.
- name: Apply Git Patch
# Deleting file at the end to skip `git apply` inside CMAKE file
run: |
cd jni/external/faiss
git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
working-directory: ${{ github.workspace }}

- name: Setup Java ${{ matrix.java }}
uses: actions/setup-java@v1
with:
Expand Down
9 changes: 9 additions & 0 deletions .github/workflows/test_security.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,15 @@ jobs:
with:
submodules: true

# Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here.
- name: Apply Git Patch
# Deleting file at the end to skip `git apply` inside CMAKE file
run: |
cd jni/external/faiss
git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
working-directory: ${{ github.workspace }}

- name: Setup Java ${{ matrix.java }}
uses: actions/setup-java@v1
with:
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.12...2.x)
### Features
* Add parent join support for lucene knn [#1182](https://github.com/opensearch-project/k-NN/pull/1182)
* Add parent join support for faiss hnsw [#1398](https://github.com/opensearch-project/k-NN/pull/1398)
### Enhancements
* Increase Lucene max dimension limit to 16,000 [#1346](https://github.com/opensearch-project/k-NN/pull/1346)
* Tuned default values for ef_search and ef_construction for better indexing and search performance for vector search [#1353](https://github.com/opensearch-project/k-NN/pull/1353)
Expand Down
7 changes: 7 additions & 0 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,13 @@ For users that want to get the most out of the libraries, they should follow [th
and build the libraries from source in their production environment, so that if their environment has optimized
instruction sets, they take advantage of them.

### Custom patch on JNI Library
If you want to make a custom patch on JNI library
1. Make a change on top of current version of JNI library and push the commit locally.
2. Create a patch file for the change using `git format-patch -o patches HEAD^`
3. Place the patch file under `jni/patches`
4. Make a change in `jni/CmakeLists.txt`, `.github/workflows/CI.yml` to apply the patch during build

## Run OpenSearch k-NN

### Run Single-node Cluster Locally
Expand Down
56 changes: 56 additions & 0 deletions benchmarks/perf-tool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,26 @@ Ingests a dataset of multiple context types into the cluster.
| ----------- | ----------- | ----------- |
| took | Total time to ingest the dataset into the index.| ms |

#### ingest_nested_field

Ingests a dataset with nested field into the cluster.

##### Parameters

| Parameter Name | Description | Default |
| ----------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- |
| index_name | Name of index to ingest into | No default |
| field_name | Name of field to ingest into | No default |
| dataset_path | Path to data-set | No default |
| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default |
| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file. It should contains { name: 'parent_id', type: 'int'} | No default |

##### Metrics

| Metric Name | Description | Unit |
| ----------- | ----------- | ----------- |
| took | Total time to ingest the dataset into the index.| ms |

#### query

Runs a set of queries against an index.
Expand Down Expand Up @@ -330,6 +350,36 @@ Runs a set of queries with filter against an index.
| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |


#### query_nested_field

Runs a set of queries with nested field against an index.

##### Parameters

| Parameter Name | Description | Default |
| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
| k | Number of neighbors to return on search | 100 |
| r | r value in Recall@R | 1 |
| index_name | Name of index to search | No default |
| field_name | Name field to search | No default |
| calculate_recall | Whether to calculate recall values | False |
| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
| dataset_path | Path to dataset | No default |
| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' |
| neighbors_path | Path to neighbors dataset | No default |
| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default |
| query_count | Number of queries to create from data-set | Size of the data-set |

##### Metrics

| Metric Name | Description | Unit |
| ----------- | ----------- | ----------- |
| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms |
| memory_kb | Native memory k-NN is using at the end of the query workload | KB |
| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 |

#### get_stats

Gets the index stats.
Expand Down Expand Up @@ -369,6 +419,12 @@ python add-filters-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dat

After that new dataset(s) can be referred from testcase definition in `ingest_extended` and `query_with_filter` steps.

To generate dataset with parent doc id based on vectors only dataset, use following command pattern:
```commandline
python add-parent-doc-id-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dataset_with_parent_id>
```
This will generate neighbours dataset as well. This new dataset(s) can be referred from testcase definition in `ingest_nested_field` and `query_nested_field` steps.

## Contributing

### Linting
Expand Down
Loading

0 comments on commit 8e5005d

Please sign in to comment.