Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Add AbstractRetriverTool, VectorDBTool, NeuralSparseTools #40

Merged
merged 15 commits into from
Dec 22, 2023

Conversation

zhichao-aws
Copy link
Member

@zhichao-aws zhichao-aws commented Dec 20, 2023

Description

Move the AbstractRetriverTool, VectorDBTool, NeuralSparseTools from ml-commons agent_framework_dev branch to skills.

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Copy link

codecov bot commented Dec 21, 2023

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws
Copy link
Member Author

UT are finished, the integration test are still blocked by an exception during cluster bootstrap:

org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: Plugin [skills] cannot extend non-extensible plugin [opensearch-ml]

This issue depends on the ml-commons plugin. I think we can review and merge this PR first to unblock other developers. After ml-commons fix the issue, we can add integration test in follow up PRs.

public static final String INDEX_FIELD = "index";
public static final String SOURCE_FIELD = "source_field";
public static final String DOC_SIZE_FIELD = "doc_size";
public static final int DEFAULT_DOC_SIZE = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to control the matching docs size? I remember it's 10 by default in search API. Are we giving 2 randomly or there's a reason behind this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number is inherited from origin VectorDBTool. I guess the reason maybe we want to limit the context size for LLM. I'm open for changing this number

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's confirm and add a comment for this field. I'm also curious to know more about this field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The origin VectorDBTool was introduced from the agent framework POC commit. opensearch-project/ml-commons@1b85cff#diff-b353ce1eda5b942809ea500dadcfda5769504edd8fa17fc174d498937097c69eR67 Hi @ylwu-amzn, do you know why the default doc size is 2? Do we set this for context length?

By the way I don't think this is a blocker issue for this PR. This parameter is configurable, and we can alter its value after our e2e test.

for (int i = 0; i < hits.length; i++) {
SearchHit hit = hits[i];
Map<String, Object> docContent = new HashMap<>();
docContent.put("_index", hit.getIndex());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason to extract these fields instead of deserialize the whole searchResponse as result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/opensearch-project/OpenSearch/blob/2c8ee1947b55a1cc5bc1a114b82e3a3b8a99851e/server/src/main/java/org/opensearch/search/SearchHit.java#L616
If we just deserialize the object, the unnecessary fields may increase the size of context. E.g. nestedIdentity, version, seqNo, primaryTerm etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result looks like in jsonl format, is this as expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this result will be treated as a norm string and be sent to LLM.

zhichao-aws and others added 4 commits December 21, 2023 17:27
Co-authored-by: zane-neo <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Copy link
Contributor

@mingshl mingshl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-aws overall it looks good to me even though the codecov is failing.

I wrote some tests to increase code coverage for AbstactRetrieverTool, we can merge this PR first and I will help contribute the codecov in a new PR.

mingshl/ml-commons@93c729a

build.gradle Outdated
@@ -165,6 +170,8 @@ compileJava {
options.compilerArgs.addAll(["-processor", 'lombok.launch.AnnotationProcessorHider$AnnotationProcessor'])
}

forbiddenApisTest.ignoreFailures = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment why do we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's inherited from ml-commons build script. I try to build without this line and get errors like

Forbidden annotation use: org.junit.Test [defaultMessage Just name your test method testFooBar]
  in org.opensearch.agent.tools.NeuralSparseSearchToolTests (NeuralSparseSearchToolTests.java, annotation on method declaration of 'testCreateTool()')

I tried to do some research but can not get any useful information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if I removes the @test annotation, it just can not find the test case

@@ -82,7 +82,7 @@ configurations {
zipArchive
all {
resolutionStrategy {
force "org.mockito:mockito-core:5.8.0"
force "org.mockito:mockito-core:${versions.mockito}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will be the current version in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build.gradle Outdated
testLogging {
exceptionFormat "full"
events "skipped", "passed", "failed" // "started"
showStandardStreams true
}
include '**/*Tests.class'
systemProperty 'tests.security.manager', 'false'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment what is this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns off the security manager to make it easy for writing test case. It is used in many build script of Opensearch plugins. However we don't have such test cases currently. So I just removed this line and we can add it back when we need it.

public static final String INDEX_FIELD = "index";
public static final String SOURCE_FIELD = "source_field";
public static final String DOC_SIZE_FIELD = "doc_size";
public static final int DEFAULT_DOC_SIZE = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's confirm and add a comment for this field. I'm also curious to know more about this field.

}
listener.onResponse((T) contextBuilder.toString());
} else {
listener.onResponse((T) "Can not get any match from search result.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a customer facing response? If yes, how about: "No matches found. Please refine your search terms."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my mental model the retrieval tools are used for RAG, so their response will be parsed by LLM instead of human. We don't have a tool-retry logic now, the "refine your search terms" may confuse LLM

build.gradle Outdated Show resolved Hide resolved
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
@zane-neo zane-neo merged commit c088f77 into opensearch-project:main Dec 22, 2023
3 of 11 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/skills/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/skills/backport-2.x
# Create a new branch
git switch --create backport-40-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 c088f7793c3e93ba0488b0be8bacc49c7195682a
# Push it to GitHub
git push --set-upstream origin backport-40-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/skills/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport-40-to-2.x.

zhichao-aws pushed a commit that referenced this pull request Dec 25, 2023
feature: Add AbstractRetriverTool, VectorDBTool, NeuralSparseTools
(cherry picked from commit c088f77)
zhichao-aws added a commit that referenced this pull request Dec 25, 2023
…lSparseTools (#58)

* Merge pull request #40 from zhichao-aws/SearchTools

feature: Add AbstractRetriverTool, VectorDBTool, NeuralSparseTools
(cherry picked from commit c088f77)

* fix commons-lang3 version (#45) (#59)

Signed-off-by: zhichao-aws <[email protected]>

---------

Signed-off-by: zhichao-aws <[email protected]>
Co-authored-by: zane-neo <[email protected]>
yuye-aws pushed a commit to yuye-aws/skills that referenced this pull request Apr 26, 2024
…lSparseTools (opensearch-project#58)

* Merge pull request opensearch-project#40 from zhichao-aws/SearchTools

feature: Add AbstractRetriverTool, VectorDBTool, NeuralSparseTools
(cherry picked from commit c088f77)

* fix commons-lang3 version (opensearch-project#45) (opensearch-project#59)

Signed-off-by: zhichao-aws <[email protected]>

---------

Signed-off-by: zhichao-aws <[email protected]>
Co-authored-by: zane-neo <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants