Optimizing integ tests for less model upload calls #683

martin-gaievski · 2024-04-10T21:02:31Z

Description

Refactored integration tests to minimize number of local model uploads/deployments. This is a first step in attempt to solve test failures in distribution build pipeline. With multiple model redeployments at some point we're reaching the state when memory circuit breaker on ml-commons side opens and next model upload call is failing.

There is a separate issue opened for ml-commons for possible long term fix on their side opensearch-project/ml-commons#2308. As of now there are no plans to work on it, as per their suggestions I'm trying to optimize tests on neural-search side.

In this PR I'm doing following:

replacing knn vector search queries with other types of queries, mainly bm25 text searchers
merging simple tests into a single method with multiple search queries. This should help as such single test method uses one model, while for multiple smaller methods there will be many model uploads. Now number of integ test methods lowered 64 -> 57.
increasing threshold for amount of memory that can be allocated for JVM heap in ml-commons. Default is 85%, I'm setting it to 95%. I tested it with 100%, it gives worse results, I assume that's because memory cannot be allocated to some code paths with native code.

Tested in copy of infra environment, tests are passing:

2024-04-10 22:12:51 INFO     ===============================================
2024-04-10 22:12:51 INFO     Running integration tests for neural-search without-security
2024-04-10 22:12:51 INFO     ===============================================
2024-04-10 22:12:51 INFO     Executing "bash /local/home/dev/opensearch/opensearch-build/scripts/default/integtest.sh -b localhost -p 9200 -s false -v 3.0.0" in /tmp/tmpvo11uccc/neural-search
2024-04-10 22:16:48 INFO     Recording component test results for neural-search at /local/home/dev/opensearch/opensearch-build/test-results/32ef973a0cb743f5bf2f8c29df105e83/integ-test/neural-search/without-security
2024-04-10 22:16:48 INFO     Stderr reported for component: neural-search
2024-04-10 22:16:48 INFO     Note: /tmp/tmpvo11uccc/neural-search/src/main/java/org/opensearch/neuralsearch/processor/combination/ScoreCombinationUtil.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Apr 10, 2024 10:13:26 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release

2024-04-10 22:16:48 INFO     Sending SIGKILL to PID 2755
2024-04-10 22:16:48 INFO     Process killed with exit code None
2024-04-10 22:16:48 INFO     Recording local cluster logs for neural-search with test configuration as without-security at /local/home/dev/opensearch/opensearch-build/test-results/32ef973a0cb743f5bf2f8c29df105e83/integ-test/neural-search/without-security/local-cluster-logs/id-0
2024-04-10 22:16:48 INFO     Cleanup /tmp/tmpvo11uccc/1/local-test-cluster/opensearch-3.0.0 content after the test
2024-04-10 22:16:48 INFO     Walking tree from /local/home/dev/opensearch/opensearch-build/test-results/32ef973a0cb743f5bf2f8c29df105e83/integ-test/neural-search/without-security
2024-04-10 22:16:48 INFO     Removing /tmp/tmpvo11uccc
2024-04-10 22:16:48 INFO     | neural-search        | without-security     | PASS  |

Issues Resolved

#667

Check List

All tests pass
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

martin-gaievski · 2024-04-10T23:43:50Z

Rolling tests BWC are failing for Linux due to some issue in Lucene/core that is unrelated to this change. Currently same issue exists for main branch, created new issue to track those failures: #688

 [2024-04-10T23:36:50,728][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [neuralSearchBwcCluster-rolling-0] uncaught exception in thread [main]
  org.opensearch.bootstrap.StartupException: OpenSearchException[failed to bind service]; nested: IndexFormatTooNewException[Format version is not supported (resource BufferedChecksumIndexInput(NIOFSIndexInput(path="/local/home/gaievski/dev/opensearch/        neural-search/qa/rolling-upgrade/build/testclusters/neuralSearchBwcCluster-rolling-0/data/nodes/0/_state/_2o.cfs") [slice=_2o.fnm])): 1 (needs to be between 0 and 0)];
          at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:185) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  Caused by: org.opensearch.OpenSearchException: failed to bind service
          at org.opensearch.node.Node.<init>(Node.java:1331) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.node.Node.<init>(Node.java:413) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:181) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          ... 6 more
  Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource BufferedChecksumIndexInput(NIOFSIndexInput(path="/local/home/gaievski/dev/opensearch/neural-search/qa/rolling-upgrade/build/testclusters/                neuralSearchBwcCluster-rolling-0/data/nodes/0/_state/_2o.cfs") [slice=_2o.fnm])): 1 (needs to be between 0 and 0)
          at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:214) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:254) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.codecs.lucene94.Lucene94FieldInfosFormat.read(Lucene94FieldInfosFormat.java:134) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:112) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:96) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:94) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:77) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:820) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:67) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:60) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.opensearch.gateway.PersistedClusterStateService.nodeMetadata(PersistedClusterStateService.java:309) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.env.NodeEnvironment.loadNodeMetadata(NodeEnvironment.java:479) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:394) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:301) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]

Signed-off-by: Martin Gaievski <[email protected]>

vibrantvarun · 2024-04-11T17:14:55Z

First of all, Great work @martin-gaievski. Awesome job on optimization.

vibrantvarun · 2024-04-11T17:20:29Z

Overall looks good to me.

I think we should add changelog of this as in future until we get permanent fix we have to be careful of uploading models in integ tests.
Can you tests with security plugin as well?

cc: @martin-gaievski

martin-gaievski · 2024-04-11T17:32:52Z

Overall looks good to me.

I think we should add changelog of this as in future until we get permanent fix we have to be careful of uploading models in integ tests.

Can you tests with security plugin as well?

cc: @martin-gaievski

For 1 - it makes sense, just I think readme file will work better than changelog. Will push commit for this.
Reg 2 - that setup I'm using - it doesn't support with security tests, cluster is not accessible. I see the risk as very low as this change is not related to security of the system or plugin

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski · 2024-04-11T19:19:44Z

Created new issue for a long term fix (switching from model per test case to a shared models strategy) #689

src/test/java/org/opensearch/neuralsearch/processor/NormalizationProcessorIT.java

Signed-off-by: Martin Gaievski <[email protected]>

ryanbogan

LGTM! This eliminates a major release blocker for release owners. For my understanding, the long term fix is so that we don't accidentally trigger the circuit breaker by adding more tests in the future right?

martin-gaievski · 2024-04-12T17:33:07Z

LGTM! This eliminates a major release blocker for release owners. For my understanding, the long term fix is so that we don't accidentally trigger the circuit breaker by adding more tests in the future right?

For long term we're still deciding what should be the fix, idea is to switch from "one model per test case" to global models concept, so once deployed one type of model can be used for all tests that needs that model type. It should be possible as we only have 3 model types and models are immutable, so no need in sync of the state between tests

opensearch-trigger-bot · 2024-04-12T18:01:39Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-683-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 7106d205a05404ab134beba595bb9da90592ba7b
# Push it to GitHub
git push --set-upstream origin backport/backport-683-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-683-to-2.x.

…t#683) * Optimizing tests for less model upload calls Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 7106d20)

…t#683) * Optimizing tests for less model upload calls Signed-off-by: Martin Gaievski <[email protected]> (cherry picked from commit 7106d20) Signed-off-by: Martin Gaievski <[email protected]>

* Optimizing tests for less model upload calls (cherry picked from commit 7106d20) Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski added Maintenance Add support for new versions of OpenSearch/Dashboards from upstream skip-changelog v2.14.0 labels Apr 10, 2024

martin-gaievski force-pushed the fix_model_upload_in_integ_tests branch 2 times, most recently from 5e25a0b to 072c50c Compare April 10, 2024 21:26

Optimizing tests for less model upload calls

bbfbb27

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the fix_model_upload_in_integ_tests branch from 072c50c to bbfbb27 Compare April 11, 2024 00:05

martin-gaievski added 2 commits April 11, 2024 08:55

Trying to fix bwc on linux by lowering threshold for jvm heap

821dee8

Signed-off-by: Martin Gaievski <[email protected]>

Reverting back setting for jvm heap threshold

33afe65

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski marked this pull request as ready for review April 11, 2024 17:05

martin-gaievski requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es, vibrantvarun and zhichao-aws as code owners April 11, 2024 17:05

vibrantvarun approved these changes Apr 11, 2024

View reviewed changes

Adding to readme note regarding limited model deployments

5740164

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski mentioned this pull request Apr 11, 2024

Optimize the way of deploying local models for integration tests #689

Closed

martin-gaievski mentioned this pull request Apr 11, 2024

[BUG] Failing integ test due to model is not deployed due to open memory circuit breaker #596

Closed

ryanbogan reviewed Apr 12, 2024

View reviewed changes

src/test/java/org/opensearch/neuralsearch/processor/NormalizationProcessorIT.java Outdated Show resolved Hide resolved

Fix typo in comments

dfb490f

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski requested a review from ryanbogan April 12, 2024 16:56

ryanbogan approved these changes Apr 12, 2024

View reviewed changes

martin-gaievski added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Apr 12, 2024

junqiu-lei approved these changes Apr 12, 2024

View reviewed changes

martin-gaievski merged commit 7106d20 into opensearch-project:main Apr 12, 2024
62 of 68 checks passed

martin-gaievski mentioned this pull request Apr 12, 2024

[Backport 2.x] Optimizing integ tests for less model upload calls (#683) #691

Merged

martin-gaievski added a commit that referenced this pull request Apr 12, 2024

Optimizing integ tests for less model upload calls (#683) (#691)

5d79f90

* Optimizing tests for less model upload calls (cherry picked from commit 7106d20) Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski mentioned this pull request Apr 12, 2024

[BUG] BWC are failing for main -> 2.14.0-SNAPSHOT #688

Closed

martin-gaievski mentioned this pull request Apr 30, 2024

[ACTION NEEDED] Fix flaky integration tests at distribution level #667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing integ tests for less model upload calls #683

Optimizing integ tests for less model upload calls #683

martin-gaievski commented Apr 10, 2024 •

edited

Loading

martin-gaievski commented Apr 10, 2024 •

edited

Loading

vibrantvarun commented Apr 11, 2024

vibrantvarun commented Apr 11, 2024 •

edited

Loading

martin-gaievski commented Apr 11, 2024

martin-gaievski commented Apr 11, 2024

ryanbogan left a comment

martin-gaievski commented Apr 12, 2024

opensearch-trigger-bot bot commented Apr 12, 2024

Optimizing integ tests for less model upload calls #683

Optimizing integ tests for less model upload calls #683

Conversation

martin-gaievski commented Apr 10, 2024 • edited Loading

Description

Issues Resolved

Check List

martin-gaievski commented Apr 10, 2024 • edited Loading

vibrantvarun commented Apr 11, 2024

vibrantvarun commented Apr 11, 2024 • edited Loading

martin-gaievski commented Apr 11, 2024

martin-gaievski commented Apr 11, 2024

ryanbogan left a comment

Choose a reason for hiding this comment

martin-gaievski commented Apr 12, 2024

opensearch-trigger-bot bot commented Apr 12, 2024

martin-gaievski commented Apr 10, 2024 •

edited

Loading

martin-gaievski commented Apr 10, 2024 •

edited

Loading

vibrantvarun commented Apr 11, 2024 •

edited

Loading