Add parent join support for faiss hnsw (opensearch-project#1398)

* Add patch to support multi vector in faiss (opensearch-project#1358) Signed-off-by: Heemin Kim <[email protected]> * Initialize id_map as null (opensearch-project#1363) Signed-off-by: Heemin Kim <[email protected]> * Add support of multi vector in jni (opensearch-project#1364) Signed-off-by: Heemin Kim <[email protected]> * Multi vector support for Faiss HNSW (opensearch-project#1371) Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields. Signed-off-by: Heemin Kim <[email protected]> * Add data generation script for nested field (opensearch-project#1388) Signed-off-by: Heemin Kim <[email protected]> * Add perf test for nested field (opensearch-project#1394) Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Heemin Kim <[email protected]> (cherry picked from commit 709b448)
heemin32 · Jan 19, 2024 · 8e5005d · 8e5005d
1 parent 7900dbb
commit 8e5005d
Show file tree

Hide file tree

Showing 46 changed files with 3,333 additions and 99 deletions.
diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml
@@ -38,6 +38,15 @@ jobs:
         with:
           submodules: true
 
+      # Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here.
+      - name: Apply Git Patch
+        # Deleting file at the end to skip `git apply` inside CMAKE file
+        run: |
+          cd jni/external/faiss
+          git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
+          rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
+        working-directory: ${{ github.workspace }}
+
       - name: Setup Java ${{ matrix.java }}
         uses: actions/setup-java@v1
         with:

diff --git a/.github/workflows/test_security.yml b/.github/workflows/test_security.yml
@@ -38,6 +38,15 @@ jobs:
         with:
           submodules: true
 
+        # Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here.
+      - name: Apply Git Patch
+        # Deleting file at the end to skip `git apply` inside CMAKE file
+        run: |
+          cd jni/external/faiss
+          git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
+          rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
+        working-directory: ${{ github.workspace }}
+
       - name: Setup Java ${{ matrix.java }}
         uses: actions/setup-java@v1
         with:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,6 +15,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 ## [Unreleased 2.x](https://github.com/opensearch-project/k-NN/compare/2.12...2.x)
 ### Features
 * Add parent join support for lucene knn [#1182](https://github.com/opensearch-project/k-NN/pull/1182)
+* Add parent join support for faiss hnsw [#1398](https://github.com/opensearch-project/k-NN/pull/1398)
 ### Enhancements
 * Increase Lucene max dimension limit to 16,000 [#1346](https://github.com/opensearch-project/k-NN/pull/1346)
 * Tuned default values for ef_search and ef_construction for better indexing and search performance for vector search [#1353](https://github.com/opensearch-project/k-NN/pull/1353)

diff --git a/DEVELOPER_GUIDE.md b/DEVELOPER_GUIDE.md
@@ -229,6 +229,13 @@ For users that want to get the most out of the libraries, they should follow [th
 and build the libraries from source in their production environment, so that if their environment has optimized 
 instruction sets, they take advantage of them.
 
+### Custom patch on JNI Library
+If you want to make a custom patch on JNI library
+1. Make a change on top of current version of JNI library and push the commit locally.
+2. Create a patch file for the change using `git format-patch -o patches HEAD^`
+3. Place the patch file under `jni/patches`
+4. Make a change in `jni/CmakeLists.txt`, `.github/workflows/CI.yml` to apply the patch during build
+
 ## Run OpenSearch k-NN
 
 ### Run Single-node Cluster Locally

diff --git a/benchmarks/perf-tool/README.md b/benchmarks/perf-tool/README.md
@@ -270,6 +270,26 @@ Ingests a dataset of multiple context types into the cluster.
 | ----------- | ----------- | ----------- |
 | took | Total time to ingest the dataset into the index.| ms |
 
+#### ingest_nested_field
+
+Ingests a dataset with nested field into the cluster.
+
+##### Parameters
+
+| Parameter Name | Description                                                                                                                                                                                                      | Default |  
+| ----------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- |
+| index_name | Name of index to ingest into                                                                                                                                                                                     | No default |
+| field_name | Name of field to ingest into                                                                                                                                                                                     | No default |
+| dataset_path | Path to data-set                                                                                                                                                                                                 | No default |
+| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset                                                                                                                                               | No default |
+| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file. It should contains { name: 'parent_id', type: 'int'}  | No default |
+
+##### Metrics
+
+| Metric Name | Description | Unit |  
+| ----------- | ----------- | ----------- |
+| took | Total time to ingest the dataset into the index.| ms |
+
 #### query
 
 Runs a set of queries against an index.
@@ -330,6 +350,36 @@ Runs a set of queries with filter against an index.
 | recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
 | recall@K | ratio of results returned that were ground truth nearest neighbors  | float 0.0-1.0 |
 
+
+#### query_nested_field
+
+Runs a set of queries with nested field against an index.
+
+##### Parameters
+
+| Parameter Name | Description                                                                                                                                                                                                                               | Default              |  
+| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
+| k | Number of neighbors to return on search                                                                                                                                                                                                   | 100                  |
+| r | r value in Recall@R                                                                                                                                                                                                                       | 1                    |
+| index_name | Name of index to search                                                                                                                                                                                                                   | No default           |
+| field_name | Name field to search                                                                                                                                                                                                                      | No default           |
+| calculate_recall | Whether to calculate recall values                                                                                                                                                                                                        | False                |
+| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs.                                                                               | 'hdf5'               |
+| dataset_path | Path to dataset                                                                                                                                                                                                                           | No default           |
+| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs.                                                                     | 'hdf5'               |
+| neighbors_path | Path to neighbors dataset                                                                                                                                                                                                                 | No default           |
+| neighbors_dataset | Name of filter dataset inside the neighbors dataset                                                                                                                                                                                       | No default           |
+| query_count | Number of queries to create from data-set                                                                                                                                                                                                 | Size of the data-set |
+
+##### Metrics
+
+| Metric Name | Description | Unit |  
+| ----------- | ----------- | ----------- |
+| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms |
+| memory_kb | Native memory k-NN is using at the end of the query workload | KB |
+| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 |
+| recall@K | ratio of results returned that were ground truth nearest neighbors  | float 0.0-1.0 |
+
 #### get_stats
 
 Gets the index stats.
@@ -369,6 +419,12 @@ python add-filters-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dat
 
 After that new dataset(s) can be referred from testcase definition in `ingest_extended` and `query_with_filter` steps.
 
+To generate dataset with parent doc id based on vectors only dataset, use following command pattern:
+```commandline
+python add-parent-doc-id-to-dataset.py <path_to_dataset_with_vectors> <path_of_new_dataset_with_parent_id>
+```
+This will generate neighbours dataset as well. This new dataset(s) can be referred from testcase definition in `ingest_nested_field` and `query_nested_field` steps.
+
 ## Contributing 
 
 ### Linting