Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][client] PIP-229: Add a common interface to get fields of MessageIdData #19414

Merged

Conversation

BewareMyPower
Copy link
Contributor

Master issue: #18950

Motivation

We need a common interface to get fields of the MessageIdData. After that, we won't need to assert a MessageId implementation is an instance of a specific class. And we can pass our customized MessageId implementation to APIs like acknowledge and seek.

Modifications

  • Add MessageIdAdv to get fields of MessageIdData, make all MessageId implementations inherit it (except MultiMessageIdImpl).
  • Add MessageIdAdvUtils for the most common used methods.
  • Replace BatchMessageAcker with the BitSet for ACK.
  • Remove TopicMessageIdImpl#getInnerMessageId since a TopicMessageIdImpl can be treated as its underlying MessageId implementation now.
  • Remove instanceof BatchMessageIdImpl checks in pulsar-client module by casting to MessageIdAdv.

After this refactoring, the 3rd party library will no longer need to cast a MessageId to a specific implementation. It only needs to cast MessageId to MessageIdAdv. Users can also implement their own util class so the methods of MessageIdAdvUtils are all not public.

Verifications

Add CustomMessageIdTest to verify a simple MessageIdAdv implementation that only has the (ledger id, entry id, batch idx, batch size) fields also works for seek and acknowledgment.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: BewareMyPower#19

@BewareMyPower BewareMyPower added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/client type/refactor Code or documentation refactors. e.g. refactor code structure or methods to improve code readability labels Feb 3, 2023
@BewareMyPower BewareMyPower self-assigned this Feb 3, 2023
@github-actions github-actions bot added the doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. label Feb 3, 2023
@BewareMyPower BewareMyPower force-pushed the bewaremypower/pip-229-msg-id-adv branch 2 times, most recently from f02339d to 6d1f61b Compare February 14, 2023 02:25
@BewareMyPower BewareMyPower force-pushed the bewaremypower/pip-229-msg-id-adv branch 2 times, most recently from 2270912 to 6863357 Compare February 21, 2023 09:25
@BewareMyPower BewareMyPower force-pushed the bewaremypower/pip-229-msg-id-adv branch from 6863357 to 476e7e8 Compare March 6, 2023 12:11
@BewareMyPower BewareMyPower added this to the 3.0.0 milestone Mar 10, 2023
…ssageIdData

Master issue: apache#18950

### Motivation

We need a common interface to get fields of the MessageIdData. After
that, we won't need to assert a MessageId implementation is an instance
of a specific class. And we can pass our customized MessageId
implementation to APIs like `acknowledge` and `seek`.

### Modifications

- Add `MessageIdAdv` to get fields of `MessageIdData`, make all
  MessageId implementations inherit it (except `MultiMessageIdImpl`).
- Add `MessageIdAdvUtils` for the most common used methods.
- Replace `BatchMessageAcker` with the `BitSet` for ACK.
- Remove `TopicMessageIdImpl#getInnerMessageId` since a
  `TopicMessageIdImpl` can be treated as its underlying `MessageId`
  implementation now.
- Remove `instanceof BatchMessageIdImpl` checks in `pulsar-client`
  module by casting to `MessageIdAdv`.

After this refactoring, the 3rd party library will no longer need to
cast a `MessageId` to a specific implementation. It only needs to cast
`MessageId` to `MessageIdAdv`. Users can also implement their own util
class so the methods of `MessageIdAdvUtils` are all not public.

### Verifications

Add `CustomMessageIdTest` to verify a simple MessageIdAdv implementation
that only has the (ledger id, entry id, batch idx, batch size) fields
also works for seek and acknowledgment.
@BewareMyPower BewareMyPower force-pushed the bewaremypower/pip-229-msg-id-adv branch from 476e7e8 to 64185ac Compare March 15, 2023 08:02
@BewareMyPower
Copy link
Contributor Author

ping @congbobo184 @eolivelli @gaoran10 @codelipenghui @tisonkun @Demogorgon314 @Technoboy-

@Demogorgon314 Demogorgon314 reopened this Apr 3, 2023
@BewareMyPower BewareMyPower merged commit 8c50a6c into apache:master Apr 4, 2023
@rdhabalia
Copy link
Contributor

@BewareMyPower this is a mess. We had discussed multiple times in Pulsar community to avoid any util class for MessageID because it hurts and makes complicated when we want to integrate any other components (eg: storm/spark adapter, Functions). This discussion happened many times and we rejected it.
I am not in favor of MessageIdAdvUtils and I don't think this is a good idea to implement in this way.

*
* @return the ledger ID
*/
long getLedgerId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh.. this is wrong. we should not expose ledgerID/entryID in any public interface. we are opening a can of worms and we had prevented many times in Pulsar community. we can't check all the times for all the PRs but this PR must be reverted back to avoid exposing any internal details of Pulsar. It will be difficult in future to support this API if storage layer will not be bookie or bookie API will be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal has been discussed and modified into the current version of this PR in https://lists.apache.org/thread/25rzflmkfmvxhf3my0ombnbpv7bvgy32 and received 3 binding +1s and 2 non-binding +1s in https://lists.apache.org/thread/kmjq6lf1f11mf6qb8onhnlr17n27fcv4.

The PIP is here: #18950

As I've explained in the PIP:

These details might be not much useful to application users, but they are important to developers of Pulsar and its ecosystems.

It will be difficult in future to support this API if storage layer will not be bookie or bookie API will be changed.

Currently MessageId is still exposed to users, which does not expose these internal details. The new MessageIdAdv interface is only used for covenience. Users should write the following code after this PR:

MessageIdAdv msgIdAdv = (MessageIdAdv) msgId;
if (msgIdAdv != null) {
    // Get the ledger id, entry id directly
} else {
    // Handle the case when the storage layer might change in future
}

Before this PR, there is much code like:

MessageIdImpl impl = (MessageIdImpl) msgId;
if (impl != null) {
    // Check if msgId is another implementation...
} else {
    // ...
}

You can also see these code references in the changes of this PR. If you're going to revert this PIP, please at least start the discussion in the mail list to hear more voices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not let users use ledgerId/entryId. what if we replace bookkeeper with other some other storage? in that case, such codebase will be broken and exposing internal data and encouraging users to use it not a good design decision. I didn't review this PIP before but such discussion happened multiple times in past and Pulsar community had rejected this proposal in past. As we have not done any release yet, so, it's better to revert this PR soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not let users use ledgerId/entryId.

It's an ideal assumption. In real world, there are a lot of code references that use ledgerId/entryId from the MessageId. They have to hack into the pulsar-client module and cast MessageId to a specific implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we replace bookkeeper with other some other storage?

When users need to access the detailed fields, they should have assumed the storage is BookKeeper. Otherwise, there is no need to get the storage details from the MessageId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact is that Pulsar provides a hard-to-use interface MessageId. Then users, including ecosystem developers, have to dig into the implementation details. I don't like saying that these developers use Pulsar in a wrong way. I'd like to say they are limited by the poor interface that Pulsar provided. Keeping the limitation is harmful to the Pulsar community and could make ecosystem applications flaky and easy to break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it takes a long time to correct it and I don't want to see it again.

The point is, now it's open sourced and many more external applications need such ability to access these internal fields. When Pulsar was not open sourced, the use cases were limited. But things are different now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much time to keep debating on this. And it already spent me too much time to clarify the motivation and find the code examples.

Let's use [email protected] to make decisions and hear more voices.

Copy link
Contributor

@rdhabalia rdhabalia Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Pulsar has poor interface or has any limitation. MessageId is just a reference from Pulsar and it should have correct serialization and deserialization methods. I don't know any system which encourages or provide a contract to extract messageId and depend on internal.
So, I don't think it's fair to say that Pulsar has limitations and it has poor APIs or interfaces. Abstraction is an important part of API contract and that's what Pulsar is doing. why keeping MessageID abstract from user can break application? and that's the exact reason why user application should not depend on such abstraction and should not try to hack that abstraction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Pulsar was not open sourced, the use cases were limited. But things are different now.

It's not about usecases but it's about learning to do the right things.

@BewareMyPower
Copy link
Contributor Author

BewareMyPower commented Apr 12, 2023

Pulsar Functions converts the MessageId to MessageIdImpl to get the ledger ID and entry ID:

MessageIdImpl msgId = MessageIdImpl.convertToMessageIdImpl(messageId);
long ledgerId = msgId.getLedgerId();
long entryId = msgId.getEntryId();

Kafka Connect Sink even uses reflection to get the batch index from MessageId:

Pulsar Spark Connector casts MessageId to the implementations for these internal fields:

https://github.com/streamnative/pulsar-spark/blob/fa131552ab5e857f801b91a58f97b8a09bf3eecc/src/main/scala/org/apache/spark/sql/pulsar/PulsarSources.scala#L73

The Spark case is extremely terrible, because it depends on many more internal methods like TopicMessageIdImpl#getInnerMessageId.

  def mid2Impl(mid: MessageId): MessageIdImpl = {
    mid match {
      case bmid: BatchMessageIdImpl =>
        new MessageIdImpl(bmid.getLedgerId, bmid.getEntryId, bmid.getPartitionIndex)
      case midi: MessageIdImpl => midi
      case t: TopicMessageIdImpl => mid2Impl(t.getInnerMessageId)
      case up: UserProvidedMessageId => mid2Impl(up.mid)
    }
  }

Before considering what if we replace bookkeeper with other some other storage?, I'd like to ask:

  • What if the getInnerMessageId method was removed?
  • What if the TopicMessageIdImpl class was removed?

I believe can find more cases the external applications want to access these internal fields. Even if we didn't encourage them to do that. The MessageIdAdv interface is not added to encourage users to use internal data. Instead, many users already assume the storage is BK and want to use internal data but they have to access unstable methods in the implementations module (pulsar-client). It brings the problem that each minor modification of the implementation could bring breaking changes.

BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Apr 19, 2023
### Motivation

apache#19414 does not follow the design
of apache#18950

> Since the aimed developers are Pulsar core developers, it's added in
> the pulsar-common module (PulsarApi.proto is also in this module), not
> the pulsar-client-api module.

The reason is that `TopicMessageId#create` now cannot be a
`MessageIdAdv` if `MessageIdAdv` is not in the `pulsar-client-api`
module.

### Modifications

- Move the `MessageIdAdv` class to the `pulsar-common` module.
- To handle the instance created by `TopicMessageId` well, add
  `MessageIdAdvUtils#convert` to add an extra deserialization and
  serialization of `MessageId`.
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Apr 19, 2023
### Motivation

apache#19414 does not follow the design
of apache#18950

> Since the aimed developers are Pulsar core developers, it's added in
> the pulsar-common module (PulsarApi.proto is also in this module), not
> the pulsar-client-api module.

The reason is that `TopicMessageId#create` now cannot be a
`MessageIdAdv` if `MessageIdAdv` is not in the `pulsar-client-api`
module.

### Modifications

- Move the `MessageIdAdv` class to the `pulsar-common` module.
- Create a `TopicMessageIdImpl` instance for `TopicMessageId#create` via
  the `DefaultImplementation` class with the overhead of reflection.
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Apr 19, 2023
### Motivation

apache#19414 does not follow the design
of apache#18950

> Since the aimed developers are Pulsar core developers, it's added in
> the pulsar-common module (PulsarApi.proto is also in this module), not
> the pulsar-client-api module.

The reason is that `TopicMessageId#create` now cannot be a
`MessageIdAdv` if `MessageIdAdv` is not in the `pulsar-client-api`
module.

### Modifications

- Move the `MessageIdAdv` class to the `pulsar-common` module.
- Implement the `MessageIdAdv` interface in `TopicMessageIdImpl` instead
  of `TopicMessageId.Impl`.
- Create a `TopicMessageIdImpl` instance for `TopicMessageId#create` via
  the `DefaultImplementation` class with the overhead of reflection.
@Shawyeok
Copy link
Contributor

Add MessageIdAdvUtils for the most common used methods.
Replace BatchMessageAcker with the BitSet for ACK.

There is a race condition on acknowledge with batch message, details #22352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. ready-to-test type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages type/refactor Code or documentation refactors. e.g. refactor code structure or methods to improve code readability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants