Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Ignore and remove the replicator cursor when the remote cluster is absent #19972

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Mar 30, 2023

Motivation

Sometimes when a remote cluster is deleted, the replication cursor might still exist for some topics. In this case, creating producers or consumers on these topics will fail.

Here is a log observed in a production environment:

WARN org.apache.pulsar.broker.service.BrokerService - Replication or
dedup check failed. Removing topic from topics list
persistent://public/__kafka/__consumer_offsets-partition-40,
java.util.concurrent.CompletionException: java.lang.RuntimeException:
org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException:
kop

If it happened, unloading the topic or restarting the broker could not help. We have to remove the cursor manually.

Modifications

In addReplicationCluster, before getting the replication client, check the namespace policy and topic policy first. If the remote cluster does not exist, skip adding the replication client and remove the cursor.

Verifications

PersistentTopicTest#testCreateTopicWithZombieReplicatorCursor is added to verify PersistentTopic#initialize will succeed and the zombie replicator cursor will be removed.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: BewareMyPower#23

@github-actions
Copy link

@BewareMyPower Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Mar 30, 2023
Copy link
Contributor

@poorbarcode poorbarcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShadowReplicator hits this issue, too. Could you please also fix it?

@BewareMyPower
Copy link
Contributor Author

ShadowReplicator hits this issue, too. Could you please also fix it?

I think it's another issue. You can open another PR for it.

@BewareMyPower BewareMyPower marked this pull request as draft March 31, 2023 07:50
… cluster is absent

### Motivation

Sometimes when a remote cluster is deleted, the replication cursor might
still exist for some topics. In this case, creating producers or
consumers on these topics will fail.

Here is a log observed in a production environment:

> WARN  org.apache.pulsar.broker.service.BrokerService - Replication or
> dedup check failed. Removing topic from topics list
> persistent://public/__kafka/__consumer_offsets-partition-40,
> java.util.concurrent.CompletionException: java.lang.RuntimeException:
> org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException:
> kop

If it happened, unloading the topic or restarting the broker could not
help. We have to remove the cursor manually.

### Modificatons

In `addReplicationCluster`, before getting the replication client, check
the namespace policy and topic policy first. If the remote cluster does
not exist, skip adding the replication client and remove the cursor.

### Verifications

`PersistentTopicTest#testCreateTopicWithZombieReplicatorCursor` is added
to verify `PersistentTopic#initialize` will succeed and the zombie
replicator cursor will be removed.
@BewareMyPower BewareMyPower force-pushed the bewaremypower/replicator-zombie-cursors branch from 1469b85 to 8aeb37d Compare March 31, 2023 09:30
@BewareMyPower BewareMyPower marked this pull request as ready for review March 31, 2023 09:31
@BewareMyPower
Copy link
Contributor Author

Hi, @poorbarcode @315157973 Now I checked the namespace and topic policies to determine whether to delete the cursor.

@codecov-commenter
Copy link

Codecov Report

Merging #19972 (335b4d6) into master (68c10ee) will decrease coverage by 0.05%.
The diff coverage is 94.44%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #19972      +/-   ##
============================================
- Coverage     72.89%   72.85%   -0.05%     
+ Complexity    31619    31613       -6     
============================================
  Files          1861     1861              
  Lines        137356   137375      +19     
  Branches      15117    15120       +3     
============================================
- Hits         100131   100089      -42     
- Misses        29271    29329      +58     
- Partials       7954     7957       +3     
Flag Coverage Δ
inttests 24.28% <50.00%> (-0.15%) ⬇️
systests 25.00% <0.00%> (-0.04%) ⬇️
unittests 72.15% <94.44%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...sar/broker/service/persistent/PersistentTopic.java 80.15% <94.44%> (+0.24%) ⬆️

... and 77 files with indirect coverage changes

Comment on lines +1710 to +1714
.thenCompose(__ -> checkReplicationCluster(remoteCluster))
.thenCompose(clusterExists -> {
if (!clusterExists) {
log.warn("Remove the replicator because the cluster '{}' does not exist", remoteCluster);
return removeReplicator(remoteCluster).thenApply(__ -> null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we clean up when the Policy is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this comment when I merged. And yes, we should do that as well. WDYT? @poorbarcode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@315157973

Why don't we clean up when the Policy is removed?

When the policy is removed, the cursor will not be clean If this topic is not online. see

@BewareMyPower BewareMyPower merged commit d1fc732 into apache:master Apr 4, 2023
BewareMyPower added a commit that referenced this pull request Apr 4, 2023
@lhotari
Copy link
Member

lhotari commented Apr 4, 2023

This PR introduced a new flaky test, reported as #20010 . @BewareMyPower do you have a chance to take a look? thanks

@BewareMyPower
Copy link
Contributor Author

Okay, I assigned that issue to me first.

replicators.get(remoteCluster).disconnect().thenRun(() -> {

Optional.ofNullable(replicators.get(remoteCluster)).map(Replicator::disconnect)
.orElse(CompletableFuture.completedFuture(null)).thenRun(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orElseGet is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not. See

    public static <U> CompletableFuture<U> completedFuture(U value) {
        return new CompletableFuture<U>((value == null) ? NIL : value);
    }

so completedFuture(null) won't create a new instance.

Technoboy- pushed a commit that referenced this pull request May 9, 2023
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request May 11, 2023
… cluster is absent (apache#19972)

(cherry picked from commit d1fc732)
(cherry picked from commit b93176c)
@michaeljmarshall
Copy link
Member

As discussed on the mailing list https://lists.apache.org/thread/w4jzk27qhtosgsz7l9bmhf1t7o9mxjhp, there is no plan to release 2.9.6, so I am going to remove the release/2.9.6 label

tisonkun pushed a commit to tisonkun/pulsar that referenced this pull request Jul 12, 2023
…ting logic for better handle the checkpoint. (apache#19972)

* Change the initial start cursor and stop cursor to better handle the consuming behaviors.
* Create the initial subscription instead seek every time. This should fix the wrong position setting.
* Fix the wrong stop cursor, make sure it stops at the correct space
* Drop Consumer.seek() for apache#16171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants