-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][client] PIP-229: Add a common interface to get fields of MessageIdData #19414
[improve][client] PIP-229: Add a common interface to get fields of MessageIdData #19414
Conversation
f02339d
to
6d1f61b
Compare
2270912
to
6863357
Compare
6863357
to
476e7e8
Compare
…ssageIdData Master issue: apache#18950 ### Motivation We need a common interface to get fields of the MessageIdData. After that, we won't need to assert a MessageId implementation is an instance of a specific class. And we can pass our customized MessageId implementation to APIs like `acknowledge` and `seek`. ### Modifications - Add `MessageIdAdv` to get fields of `MessageIdData`, make all MessageId implementations inherit it (except `MultiMessageIdImpl`). - Add `MessageIdAdvUtils` for the most common used methods. - Replace `BatchMessageAcker` with the `BitSet` for ACK. - Remove `TopicMessageIdImpl#getInnerMessageId` since a `TopicMessageIdImpl` can be treated as its underlying `MessageId` implementation now. - Remove `instanceof BatchMessageIdImpl` checks in `pulsar-client` module by casting to `MessageIdAdv`. After this refactoring, the 3rd party library will no longer need to cast a `MessageId` to a specific implementation. It only needs to cast `MessageId` to `MessageIdAdv`. Users can also implement their own util class so the methods of `MessageIdAdvUtils` are all not public. ### Verifications Add `CustomMessageIdTest` to verify a simple MessageIdAdv implementation that only has the (ledger id, entry id, batch idx, batch size) fields also works for seek and acknowledgment.
476e7e8
to
64185ac
Compare
pulsar-client/src/main/java/org/apache/pulsar/client/impl/TopicMessageImpl.java
Show resolved
Hide resolved
@BewareMyPower this is a mess. We had discussed multiple times in Pulsar community to avoid any util class for MessageID because it hurts and makes complicated when we want to integrate any other components (eg: storm/spark adapter, Functions). This discussion happened many times and we rejected it. |
* | ||
* @return the ledger ID | ||
*/ | ||
long getLedgerId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh.. this is wrong. we should not expose ledgerID/entryID
in any public interface. we are opening a can of worms and we had prevented many times in Pulsar community. we can't check all the times for all the PRs but this PR must be reverted back to avoid exposing any internal details of Pulsar. It will be difficult in future to support this API if storage layer will not be bookie or bookie API will be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal has been discussed and modified into the current version of this PR in https://lists.apache.org/thread/25rzflmkfmvxhf3my0ombnbpv7bvgy32 and received 3 binding +1s and 2 non-binding +1s in https://lists.apache.org/thread/kmjq6lf1f11mf6qb8onhnlr17n27fcv4.
The PIP is here: #18950
As I've explained in the PIP:
These details might be not much useful to application users, but they are important to developers of Pulsar and its ecosystems.
It will be difficult in future to support this API if storage layer will not be bookie or bookie API will be changed.
Currently MessageId
is still exposed to users, which does not expose these internal details. The new MessageIdAdv
interface is only used for covenience. Users should write the following code after this PR:
MessageIdAdv msgIdAdv = (MessageIdAdv) msgId;
if (msgIdAdv != null) {
// Get the ledger id, entry id directly
} else {
// Handle the case when the storage layer might change in future
}
Before this PR, there is much code like:
MessageIdImpl impl = (MessageIdImpl) msgId;
if (impl != null) {
// Check if msgId is another implementation...
} else {
// ...
}
You can also see these code references in the changes of this PR. If you're going to revert this PIP, please at least start the discussion in the mail list to hear more voices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not let users use ledgerId/entryId
. what if we replace bookkeeper with other some other storage? in that case, such codebase will be broken and exposing internal data and encouraging users to use it not a good design decision. I didn't review this PIP before but such discussion happened multiple times in past and Pulsar community had rejected this proposal in past. As we have not done any release yet, so, it's better to revert this PR soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not let users use ledgerId/entryId.
It's an ideal assumption. In real world, there are a lot of code references that use ledgerId/entryId from the MessageId
. They have to hack into the pulsar-client
module and cast MessageId
to a specific implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we replace bookkeeper with other some other storage?
When users need to access the detailed fields, they should have assumed the storage is BookKeeper. Otherwise, there is no need to get the storage details from the MessageId
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact is that Pulsar provides a hard-to-use interface MessageId
. Then users, including ecosystem developers, have to dig into the implementation details. I don't like saying that these developers use Pulsar in a wrong way. I'd like to say they are limited by the poor interface that Pulsar provided. Keeping the limitation is harmful to the Pulsar community and could make ecosystem applications flaky and easy to break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it takes a long time to correct it and I don't want to see it again.
The point is, now it's open sourced and many more external applications need such ability to access these internal fields. When Pulsar was not open sourced, the use cases were limited. But things are different now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much time to keep debating on this. And it already spent me too much time to clarify the motivation and find the code examples.
Let's use [email protected] to make decisions and hear more voices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Pulsar has poor interface or has any limitation. MessageId is just a reference from Pulsar and it should have correct serialization and deserialization methods. I don't know any system which encourages or provide a contract to extract messageId and depend on internal.
So, I don't think it's fair to say that Pulsar has limitations and it has poor APIs or interfaces. Abstraction is an important part of API contract and that's what Pulsar is doing. why keeping MessageID abstract from user can break application? and that's the exact reason why user application should not depend on such abstraction and should not try to hack that abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When Pulsar was not open sourced, the use cases were limited. But things are different now.
It's not about usecases but it's about learning to do the right things.
Pulsar Functions converts the pulsar/pulsar-functions/utils/src/main/java/org/apache/pulsar/functions/utils/FunctionCommon.java Lines 328 to 330 in d1fc732
Kafka Connect Sink even uses reflection to get the batch index from Line 404 in d1fc732
Pulsar Spark Connector casts The Spark case is extremely terrible, because it depends on many more internal methods like def mid2Impl(mid: MessageId): MessageIdImpl = {
mid match {
case bmid: BatchMessageIdImpl =>
new MessageIdImpl(bmid.getLedgerId, bmid.getEntryId, bmid.getPartitionIndex)
case midi: MessageIdImpl => midi
case t: TopicMessageIdImpl => mid2Impl(t.getInnerMessageId)
case up: UserProvidedMessageId => mid2Impl(up.mid)
}
} Before considering what if we replace bookkeeper with other some other storage?, I'd like to ask:
I believe can find more cases the external applications want to access these internal fields. Even if we didn't encourage them to do that. The |
### Motivation apache#19414 does not follow the design of apache#18950 > Since the aimed developers are Pulsar core developers, it's added in > the pulsar-common module (PulsarApi.proto is also in this module), not > the pulsar-client-api module. The reason is that `TopicMessageId#create` now cannot be a `MessageIdAdv` if `MessageIdAdv` is not in the `pulsar-client-api` module. ### Modifications - Move the `MessageIdAdv` class to the `pulsar-common` module. - To handle the instance created by `TopicMessageId` well, add `MessageIdAdvUtils#convert` to add an extra deserialization and serialization of `MessageId`.
### Motivation apache#19414 does not follow the design of apache#18950 > Since the aimed developers are Pulsar core developers, it's added in > the pulsar-common module (PulsarApi.proto is also in this module), not > the pulsar-client-api module. The reason is that `TopicMessageId#create` now cannot be a `MessageIdAdv` if `MessageIdAdv` is not in the `pulsar-client-api` module. ### Modifications - Move the `MessageIdAdv` class to the `pulsar-common` module. - Create a `TopicMessageIdImpl` instance for `TopicMessageId#create` via the `DefaultImplementation` class with the overhead of reflection.
### Motivation apache#19414 does not follow the design of apache#18950 > Since the aimed developers are Pulsar core developers, it's added in > the pulsar-common module (PulsarApi.proto is also in this module), not > the pulsar-client-api module. The reason is that `TopicMessageId#create` now cannot be a `MessageIdAdv` if `MessageIdAdv` is not in the `pulsar-client-api` module. ### Modifications - Move the `MessageIdAdv` class to the `pulsar-common` module. - Implement the `MessageIdAdv` interface in `TopicMessageIdImpl` instead of `TopicMessageId.Impl`. - Create a `TopicMessageIdImpl` instance for `TopicMessageId#create` via the `DefaultImplementation` class with the overhead of reflection.
There is a race condition on acknowledge with batch message, details #22352 |
Master issue: #18950
Motivation
We need a common interface to get fields of the MessageIdData. After that, we won't need to assert a MessageId implementation is an instance of a specific class. And we can pass our customized MessageId implementation to APIs like
acknowledge
andseek
.Modifications
MessageIdAdv
to get fields ofMessageIdData
, make all MessageId implementations inherit it (exceptMultiMessageIdImpl
).MessageIdAdvUtils
for the most common used methods.BatchMessageAcker
with theBitSet
for ACK.TopicMessageIdImpl#getInnerMessageId
since aTopicMessageIdImpl
can be treated as its underlyingMessageId
implementation now.instanceof BatchMessageIdImpl
checks inpulsar-client
module by casting toMessageIdAdv
.After this refactoring, the 3rd party library will no longer need to cast a
MessageId
to a specific implementation. It only needs to castMessageId
toMessageIdAdv
. Users can also implement their own util class so the methods ofMessageIdAdvUtils
are all not public.Verifications
Add
CustomMessageIdTest
to verify a simple MessageIdAdv implementation that only has the (ledger id, entry id, batch idx, batch size) fields also works for seek and acknowledgment.Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: BewareMyPower#19