Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Regression in snapshot deletion logic #5371

Open
reta opened this issue Nov 24, 2022 · 0 comments
Open

[BUG] Regression in snapshot deletion logic #5371

reta opened this issue Nov 24, 2022 · 0 comments
Labels
bug Something isn't working distributed framework

Comments

@reta
Copy link
Collaborator

reta commented Nov 24, 2022

Describe the bug
One of the random failures of the #5219 produced an interesting stack trace (see below please):

"[groupSize must be greater than 0 but was -1]; nested: IllegalArgumentException[groupSize must be greater than 0 but was -1];
	at org.opensearch.OpenSearchException.guessRootCauses(OpenSearchException.java:679)
	at org.opensearch.OpenSearchException.generateFailureXContent(OpenSearchException.java:607)
	at org.opensearch.rest.BytesRestResponse.build(BytesRestResponse.java:164)
	at org.opensearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:125)
	at org.opensearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:105)
	at org.opensearch.rest.action.RestActionListener.onFailure(RestActionListener.java:71)
	at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:112)
	at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$2(TransportMasterNodeAction.java:193)
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109)
	at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:188)
	at org.opensearch.action.ActionListener.onFailure(ActionListener.java:247)
	at org.opensearch.snapshots.SnapshotsService.failListenersIgnoringException(SnapshotsService.java:3006)
	at org.opensearch.snapshots.SnapshotsService.access$3300(SnapshotsService.java:141)
	at org.opensearch.snapshots.SnapshotsService$15.handleListeners(SnapshotsService.java:2764)
	at org.opensearch.snapshots.SnapshotsService$RemoveSnapshotDeletionAndContinueTask.clusterStateProcessed(SnapshotsService.java:2857)
	at org.opensearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:616)
	at org.opensearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:488)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
	at org.opensearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:488)
	at org.opensearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:316)
	at org.opensearch.cluster.service.MasterService.publish(MasterService.java:308)
	at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:285)
	at org.opensearch.cluster.service.MasterService.access$000(MasterService.java:86)
	at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:173)
	at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:174)
	at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:212)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733)
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:275)
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:238)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: groupSize must be greater than 0 but was -1
	at org.opensearch.action.support.GroupedActionListener.<init>(GroupedActionListener.java:64)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.cleanupStaleIndices(BlobStoreRepository.java:1248)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.cleanupStaleBlobs(BlobStoreRepository.java:1102)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.cleanupUnlinkedRootAndIndicesBlobs(BlobStoreRepository.java:889)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$doDeleteShardSnapshots$7(BlobStoreRepository.java:854)

The issue seems to be a regression introduced by 7ccb714 and an edge case when remote snapshot repository becomes not available (for example, deleted). I was able to craft a test case which reproduces the issue all the time.

To Reproduce

@OpenSearchIntegTestCase.ClusterScope(minNumDataNodes = 2)
public class RepositoriesIT extends AbstractSnapshotIntegTestCase {
    /* .... */
    public void testSnapshotShardBlobStoreDelete() throws Exception {
        final Client client = client();
        final Path repositoryPath = randomRepoPath();
        final String repositoryName = "test-repo";

        final String fullSnapshot = "full-snapshot";
        final String firstSnapshot = "first-snapshot";
        final String secondSnapshot = "second-snapshot";
        final String firstIndexName = "test-idx-1";
        final String secondIndexName = "test-idx-2";

        int maxShardBlobDeleteBatchSize = randomIntBetween(1, 1000);
        createRepository(
            "test-repo",
            "mock",
            Settings.builder()
                .put("location", repositoryPath)
                .put(BlobStoreRepository.MAX_SNAPSHOT_SHARD_BLOB_DELETE_BATCH_SIZE.getKey(), maxShardBlobDeleteBatchSize)
        );

        // Create two indices
        createIndex(firstIndexName);
        createIndex(secondIndexName);
        ensureGreen();

        // Create two indices with small number of documents
        final int numberOfDocs = randomIntBetween(10, 20);
        for (int j = 0; j < numberOfDocs; j++) {
            index(firstIndexName, "_doc", Integer.toString(j), "foo", "bar" + j);
            index(secondIndexName, "_doc", Integer.toString(j), "foo", "bar" + j);
        }

        refresh();
        // Create a full snapshot (all indices)
        createFullSnapshot(repositoryName, fullSnapshot);
        // Create a snapshot for first index only
        createSnapshot(repositoryName, firstSnapshot, List.of(firstIndexName));
        // Create a snapshot for second index only
        createSnapshot(repositoryName, secondSnapshot, List.of(secondIndexName));
        
        final ThreadPool threadPool = internalCluster()
                .getCurrentClusterManagerNodeInstance(ClusterService.class)
                .getClusterApplierService()
                .threadPool();
        
        final MockRepository repository = (MockRepository) internalCluster()
                .getCurrentClusterManagerNodeInstance(RepositoriesService.class)
                .repository(repositoryName);

        // Get all snapshot blobs and remove only first one
        threadPool.executor(ThreadPool.Names.SNAPSHOT)
            .submit(() -> {
                final Map<String, BlobContainer> children = repository.blobStore().blobContainer(new BlobPath().add("indices")).children();
                final String blob = children.keySet().iterator().next();
                repository.blobStore().blobContainer(new BlobPath().add("indices").add(blob)).delete();
                return null;
            }).get();
        

        // Delete all snapshots
        client.admin().cluster().prepareDeleteSnapshot(repositoryName, fullSnapshot).get();
        client.admin().cluster().prepareDeleteSnapshot(repositoryName, firstSnapshot).get();
        client.admin().cluster().prepareDeleteSnapshot(repositoryName, secondSnapshot).get();
        
        assertFileCount(repositoryPath, 2);
    }

Fails with:

java.lang.IllegalArgumentException: groupSize must be greater than 0 but was -1
	at org.opensearch.action.support.GroupedActionListener.<init>(GroupedActionListener.java:66)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.cleanupStaleIndices(BlobStoreRepository.java:1267)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.cleanupStaleBlobs(BlobStoreRepository.java:1127)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.cleanupUnlinkedRootAndIndicesBlobs(BlobStoreRepository.java:872)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$15(BlobStoreRepository.java:855)
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80)
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341)
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120)
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$0(ListenableFuture.java:112)
	at java.util.ArrayList.forEach(ArrayList.java:1541)
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112)
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160)
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141)
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77)
	at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55)
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80)
	at org.opensearch.action.ActionRunnable.lambda$0(ActionRunnable.java:73)
	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(Thread.java:829)

java.lang.AssertionError: java.lang.AssertionError
	at org.opensearch.repositories.blobstore.BlobStoreTestUtil.assertConsistency(BlobStoreTestUtil.java:151)
	at org.opensearch.repositories.blobstore.BlobStoreTestUtil.assertRepoConsistency(BlobStoreTestUtil.java:110)
	at org.opensearch.snapshots.AbstractSnapshotIntegTestCase.lambda$1(AbstractSnapshotIntegTestCase.java:156)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
	at org.opensearch.snapshots.AbstractSnapshotIntegTestCase.assertRepoConsistency(AbstractSnapshotIntegTestCase.java:150)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.AssertionError
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
	at org.opensearch.repositories.blobstore.BlobStoreTestUtil.assertIndexUUIDs(BlobStoreTestUtil.java:224)
	at org.opensearch.repositories.blobstore.BlobStoreTestUtil.lambda$0(BlobStoreTestUtil.java:141)
	at org.opensearch.action.ActionRunnable.lambda$0(ActionRunnable.java:73)
	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more

Expected behavior
The case when blob store does not exist should be handled (the same testcase passes without changes introduced by #5219)

Plugins
N/A

Screenshots
M/A

Host/Environment (please complete the following information):

  • OS: any
  • Version: 1.3+

Additional context
N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

2 participants