Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky-test: PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor #20010

Closed
1 of 2 tasks
lhotari opened this issue Apr 4, 2023 · 3 comments · Fixed by #20037 · May be fixed by BewareMyPower/pulsar#24
Closed
1 of 2 tasks

Flaky-test: PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor #20010

lhotari opened this issue Apr 4, 2023 · 3 comments · Fixed by #20037 · May be fixed by BewareMyPower/pulsar#24

Comments

@lhotari
Copy link
Member

lhotari commented Apr 4, 2023

Search before asking

  • I searched in the issues and found nothing similar.

Example failure

https://github.com/apache/pulsar/actions/runs/4604351215/jobs/8140558256?pr=20005#step:11:1177

Exception stacktrace

  Error:  Tests run: 82, Failures: 1, Errors: 0, Skipped: 77, Time elapsed: 48.113 s <<< FAILURE! - in org.apache.pulsar.broker.service.persistent.PersistentTopicTest
  Error:  testCreateTopicWithZombieReplicatorCursor(org.apache.pulsar.broker.service.persistent.PersistentTopicTest)  Time elapsed: 0.367 s  <<< FAILURE!
  java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException: remote
  	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
  	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
  	at org.apache.pulsar.broker.service.persistent.PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor(PersistentTopicTest.java:587)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
  	at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at java.base/java.lang.Thread.run(Thread.java:833)
  Caused by: java.lang.RuntimeException: org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException: remote
  	at org.apache.pulsar.broker.service.BrokerService.lambda$getReplicationClient$47(BrokerService.java:1347)
  	at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.put(ConcurrentOpenHashMap.java:409)
  	at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.computeIfAbsent(ConcurrentOpenHashMap.java:243)
  	at org.apache.pulsar.broker.service.BrokerService.getReplicationClient(BrokerService.java:1271)
  	at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$addReplicationCluster$65(PersistentTopic.java:1719)
  	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
  	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
  	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
  	at org.apache.pulsar.metadata.impl.ZKMetadataStore.handleGetResult(ZKMetadataStore.java:269)
  	at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$batchOperation$5(ZKMetadataStore.java:219)
  	at org.apache.zookeeper.MockZooKeeper.multi(MockZooKeeper.java:1006)
  	at org.apache.zookeeper.MockZooKeeperSession.multi(MockZooKeeperSession.java:191)
  	at org.apache.pulsar.metadata.impl.ZKMetadataStore.batchOperation(ZKMetadataStore.java:190)
  	at org.apache.pulsar.metadata.impl.batching.AbstractBatchedMetadataStore.internalBatchOperation(AbstractBatchedMetadataStore.java:184)
  	at org.apache.pulsar.metadata.impl.batching.AbstractBatchedMetadataStore.flush(AbstractBatchedMetadataStore.java:103)
  	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
  	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
  	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  	... 1 more
  Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException: remote
  	at org.apache.pulsar.broker.service.BrokerService.lambda$getReplicationClient$46(BrokerService.java:1274)
  	at java.base/java.util.Optional.orElseThrow(Optional.java:403)
  	at org.apache.pulsar.broker.service.BrokerService.lambda$getReplicationClient$47(BrokerService.java:1274)
  	... 21 more

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@lhotari
Copy link
Member Author

lhotari commented Apr 4, 2023

again: https://github.com/apache/pulsar/actions/runs/4609863808/jobs/8152451675?pr=20011#step:11:1183

  Error:  Tests run: 82, Failures: 1, Errors: 0, Skipped: 77, Time elapsed: 49.329 s <<< FAILURE! - in org.apache.pulsar.broker.service.persistent.PersistentTopicTest
  Error:  testCreateTopicWithZombieReplicatorCursor(org.apache.pulsar.broker.service.persistent.PersistentTopicTest)  Time elapsed: 0.307 s  <<< FAILURE!
  java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException: remote
  	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
  	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
  	at org.apache.pulsar.broker.service.persistent.PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor(PersistentTopicTest.java:587)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
  	at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at java.base/java.lang.Thread.run(Thread.java:833)
  Caused by: java.lang.RuntimeException: org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException: remote
  	at org.apache.pulsar.broker.service.BrokerService.lambda$getReplicationClient$47(BrokerService.java:1347)
  	at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.put(ConcurrentOpenHashMap.java:409)
  	at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.computeIfAbsent(ConcurrentOpenHashMap.java:243)
  	at org.apache.pulsar.broker.service.BrokerService.getReplicationClient(BrokerService.java:1271)
  	at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$addReplicationCluster$65(PersistentTopic.java:1719)
  	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
  	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
  	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
  	at org.apache.pulsar.metadata.impl.ZKMetadataStore.handleGetResult(ZKMetadataStore.java:269)
  	at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$batchOperation$5(ZKMetadataStore.java:219)
  	at org.apache.zookeeper.MockZooKeeper.multi(MockZooKeeper.java:1006)
  	at org.apache.zookeeper.MockZooKeeperSession.multi(MockZooKeeperSession.java:191)
  	at org.apache.pulsar.metadata.impl.ZKMetadataStore.batchOperation(ZKMetadataStore.java:190)
  	at org.apache.pulsar.metadata.impl.batching.AbstractBatchedMetadataStore.internalBatchOperation(AbstractBatchedMetadataStore.java:184)
  	at org.apache.pulsar.metadata.impl.batching.AbstractBatchedMetadataStore.flush(AbstractBatchedMetadataStore.java:103)
  	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
  	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
  	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  	... 1 more
  Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException: remote
  	at org.apache.pulsar.broker.service.BrokerService.lambda$getReplicationClient$46(BrokerService.java:1274)
  	at java.base/java.util.Optional.orElseThrow(Optional.java:403)
  	at org.apache.pulsar.broker.service.BrokerService.lambda$getReplicationClient$47(BrokerService.java:1274)
  	... 21 more

@lhotari
Copy link
Member Author

lhotari commented Apr 5, 2023

@BewareMyPower Do you have a chance to fix this flaky test that was introduced by your PR #19972? thanks

@poorbarcode
Copy link
Contributor

@lhotari @BewareMyPower

I pushed a PR #20025 to fix this flaky test, and please take a look. Thanks

BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 6, 2023
Fixes apache#20010

### Motivation

`PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor` is flaky
because the cursor could still be created again in `startReplicator`,
which could be called by:

```
onPoliciesUpdate
  checkReplicationAndRetryOnFailure
    checkReplication
```

### Modifications

- Call `checkReplicationCluster` before calling `startReplicator`.
- Support retrying `initialize` to see if retry works.
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 6, 2023
Fixes apache#20010

### Motivation

`PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor` is flaky
because the cursor could still be created again in `startReplicator`,
which could be called by:

```
onPoliciesUpdate
  checkReplicationAndRetryOnFailure
    checkReplication
```

### Modifications

- Call `checkReplicationCluster` before calling `startReplicator`.
- Support retrying `initialize` to see if retry works.
- Check replication cluster before creating the replication client
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 6, 2023
Fixes apache#20010

### Motivation

`PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor` is flaky
because the cursor could still be created again in `startReplicator`,
which could be called by:

```
onPoliciesUpdate
  checkReplicationAndRetryOnFailure
    checkReplication
```

### Modifications

Call `checkReplicationCluster` before calling `startReplicator`.

Since there is still a rare chance that the cluster data is empty when
the cluster still exists, return null instead of throwing a runtime
exception, then skip creating the replication client.

Use `Awaitility` to check if the cursor has been deleted eventually.
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 7, 2023
Fixes apache#20010

### Motivation

`PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor` is flaky
because the cursor could still be created again in `startReplicator`,
which could be called by:

```
onPoliciesUpdate
  checkReplicationAndRetryOnFailure
    checkReplication
```

### Modifications

Call `checkReplicationCluster` before calling `startReplicator`.
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 7, 2023
Fixes apache#20010

### Motivation

`PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor` is flaky
because the cursor could still be created again in `startReplicator`,
which could be called by:

```
onPoliciesUpdate
  checkReplicationAndRetryOnFailure
    checkReplication
```

### Modifications

- Call `checkReplicationCluster` before calling `startReplicator`.
- Sleep for a while in the test to reduce the flakiness caused by the
  asynchronous update of the policies
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 7, 2023
Fixes apache#20010

### Motivation

`PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor` is flaky
because the cursor could still be created again in `startReplicator`,
which could be called by:

```
onPoliciesUpdate
  checkReplicationAndRetryOnFailure
    checkReplication
```

Sometimes the policies update might fail because the topic might be
deleted in `PersistentTopic#checkReplication`:

> Deleting topic [xxx] because local cluster is not part of global namespace repl list [remote]

### Modifications

- Call `checkReplicationCluster` before calling `startReplicator`.
- Add the local cluster to the replication cluster list
- Sleep for a while in the test to reduce the flakiness caused by the
  asynchronous update of the policies
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 11, 2023
Fixes apache#20010

### Motivation

`PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor` is flaky
because `onPoliciesUpdate` is asynchronous, while
`testCreateTopicWithZombieReplicatorCursor` updates the namespace policy
nearly the same time, so there is a race with the order of updating
`AbstractTopic#topicPolicies`.

Sometimes the policies update might fail because the topic might be
deleted in `PersistentTopic#checkReplication`:

> Deleting topic [xxx] because local cluster is not part of global namespace repl list [remote]

### Modifications

- Sleep 100ms between two calls of updating the replication clusters
- Add the local cluster to the replication cluster list
- Add the retry logic for `initialize`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment