Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 191: Add at-least-once stream processing example #192

Open
wants to merge 82 commits into
base: dev
Choose a base branch
from

Conversation

claudiofahey
Copy link

@claudiofahey claudiofahey commented Mar 14, 2019

These examples are intended to illustrate how at-least-once semantics can be achieved with Pravega. In particular, these illustrative examples do not use Apache Flink.

@claudiofahey

This comment has been minimized.

@claudiofahey claudiofahey requested a review from tkaitchuck March 17, 2019 04:25
Copy link
Member

@tkaitchuck tkaitchuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claudiofahey

But consider what happens if the application crashes after commit() but before the reader group is updated with readNextEvent. Upon recovery, the reader group will resume at the previous checkpoint and duplicate records will be created. To avoid this, we still need to use the two-phase commit process of flush, persist transaction IDs, commit, update reader group. I may be able to do this with another Pravega stream instead of a shared file system but I don't see how to completely eliminate this need.

Yes. This is the reason the readerOffline() call takes a Position object. It allows you to tell Pravega exactly what data was processed and what data was not when the worker dies.

So yes, the two phase scheme you are doing works, but you could make it better by doing it per-reader as opposed to per-readerGroup.

@EronWright EronWright removed their request for review May 2, 2019 01:02
@RaulGracia
Copy link
Contributor

@claudiofahey due to issues we had in previous releases merging develop into master, we have changed the release approach and the current development branch is dev. We plan to delete develop to avoid confusions, so could you please re-open this PR against dev? Thanks!

@claudiofahey claudiofahey force-pushed the issue-191-streamprocessing branch from a6f9340 to f4e5b94 Compare August 13, 2019 22:24
@claudiofahey claudiofahey changed the base branch from develop to dev August 13, 2019 22:26
@claudiofahey
Copy link
Author

This PR is ready for re-review.

@claudiofahey claudiofahey force-pushed the issue-191-streamprocessing branch from f4e5b94 to 0e57327 Compare June 1, 2020 00:11
@claudiofahey
Copy link
Author

The latest commit should provide at-least-once semantics in a clean way using only Pravega. I believe the code is complete but it does not have any automated tests yet. We also need to update the README to describe how this works.

@claudiofahey claudiofahey requested review from tkaitchuck and fpj and removed request for elizabethbain June 21, 2020 05:56
@claudiofahey claudiofahey removed the request for review from vijikarthi August 4, 2020 04:24
@claudiofahey
Copy link
Author

This PR is ready for review. For a description of this PR, see https://github.com/claudiofahey/pravega-samples/blob/issue-191-streamprocessing/pravega-client-examples/src/main/java/io/pravega/example/streamprocessing/README.md.
Unit and integration tests can be run with ./gradlew pravega-client-examples:build.

Copy link
Contributor

@fpj fpj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, I only have a couple of small comments.

return getEnvVar("INSTANCE_ID", UUID.randomUUID().toString());
}

public String getStream1Name() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, but could we use input stream and output stream rather than stream 1 and 2?

Copy link
Author

@claudiofahey claudiofahey Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppConfiguration is shared by EventGenerator, AtLeastOnceApp, and EventDebugSink. Stream 1 is the output for EventGenerator and the input for AtLeastOnceApp. Stream 2 is the output for AtLeastOnceApp and the input for EventDebugSink. To avoid the confusion from this point of view, I chose Stream 1 and Stream 2. It is a little odd, I admit. I added comments to the code to clarify this.

// We must ensure that we add this reader to the membership synchronizer before the reader group.
membershipSynchronizer.startAsync();
membershipSynchronizer.awaitRunning();
task = executor.scheduleAtFixedRate(new PruneRunner(), heartbeatIntervalMillis, heartbeatIntervalMillis, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would be best to add some randomization to the period between runs of the prune runner. If distinct readers run it at the same period and close to each other, then we would have them calling reader offline unnecessarily for the same offline readers.

I don't feel very strongly about the comment because:

  • reader offline is idempotent
  • crashes are rare
  • it should only matter for a larger reader groups

In any case, if you feel that you want to introduce it here, then I think it is a small improvement.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have randomized the initial delay and the period as you have suggested.

Claudio Fahey added 4 commits August 21, 2020 04:54
- Removed deleteStream calls.
- No longer waiting for transactions to go into COMMIT state after commit() call.
- Now using Callable instead of Runnable.
- Improved shutdown cleanup.

Signed-off-by: Claudio Fahey <[email protected]>
…aging state.

- Added additional startup scripts.
- Updated default parameters.
- Updated documentation.

Signed-off-by: Claudio Fahey <[email protected]>
Claudio Fahey added 3 commits March 19, 2021 17:30
@claudiofahey
Copy link
Author

@RaulGracia, I have addressed all of your comments. However, it appears that there is an intermittent timeout failure in the test kill5of6Test. I will continue to investigate.

@RaulGracia
Copy link
Contributor

@claudiofahey maybe the failure you are seeing are related to the sporadic failures we see in Pravega core repo related to tests involving Pravega standalone: pravega/pravega#5864

@RaulGracia
Copy link
Contributor

@claudiofahey I tested the sample and it works. Also compiled with JDK11 and JDK8 and it also works (./gradlew installDist). But I also observed the failure in test kill5of6Test. Please, let's try to get this test passing so we can merge this one soon.

Claudio Fahey added 3 commits March 25, 2021 19:10
@claudiofahey claudiofahey requested a review from RaulGracia March 25, 2021 19:19
@claudiofahey
Copy link
Author

FYI, I ran the integration 20 times and all have passed:

for i in {1..20}; do ./gradlew clean test; done |& tee -a /tmp/test.log
grep "SUCCESSFUL" /tmp/test.log
BUILD SUCCESSFUL in 3m 42s
BUILD SUCCESSFUL in 3m 31s
BUILD SUCCESSFUL in 3m 56s
BUILD SUCCESSFUL in 3m 41s
BUILD SUCCESSFUL in 3m 47s
BUILD SUCCESSFUL in 6m 25s
BUILD SUCCESSFUL in 4m 18s
BUILD SUCCESSFUL in 4m 34s
BUILD SUCCESSFUL in 4m 25s
BUILD SUCCESSFUL in 4m 32s
BUILD SUCCESSFUL in 4m 16s
BUILD SUCCESSFUL in 4m 16s
BUILD SUCCESSFUL in 6m 29s
BUILD SUCCESSFUL in 4m 5s
BUILD SUCCESSFUL in 4m 8s
BUILD SUCCESSFUL in 4m 6s
BUILD SUCCESSFUL in 4m 26s
BUILD SUCCESSFUL in 4m 29s
BUILD SUCCESSFUL in 4m 2s
BUILD SUCCESSFUL in 3m 37s

@RaulGracia
Copy link
Contributor

@claudiofahey tests passed for me. I had to upgrade the Scala version of Spark in the Hadoop sample, as build was broken by one of the previous PRs. But now looks good, thanks!.

@RaulGracia
Copy link
Contributor

@tkaitchuck please, would you mind to do another review to either approve it or request more changes?

processor.startAsync();

// Add shutdown hook for graceful shutdown.
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the processor do this itself if this is needed.?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets messy to do this in the AtLeastOnceProcessor because then it would need to deal with removing the shutdown hook, but only doing that when the JVM is not actually shutting down. The point of this shutdown hook is to shutdown the whole application gracefully, which in this case happens to consist of only one AtLeastOnceProcessor service. But in general, an application may consist of many services that all need to be shutdown in a particular order. So it seems appropriate that AtLeastOnceApp should coordinate the shutdown.


public void run() throws Exception {
final ClientConfig clientConfig = ClientConfig.builder().controllerURI(getConfig().getControllerURI()).build();
try (StreamManager streamManager = StreamManager.create(getConfig().getControllerURI())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the change

}
}
} finally {
readerGroupManager.deleteReaderGroup(readerGroup);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we necessarily want to delete the group because one reader shutdown.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we do want to delete the reader group. Each instance of EventDebugSink will read the entire stream so it creates a random reader group for itself and then it cleans up when it is done. There's no need for load balancing between EventDebugSink instances (in contrast to AtLeastOnceApp).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a createStream private method to this class.


public void run() throws Exception {
final ClientConfig clientConfig = ClientConfig.builder().controllerURI(getConfig().getControllerURI()).build();
try (StreamManager streamManager = StreamManager.create(getConfig().getControllerURI())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Same here" referred to making this a private method.

@claudiofahey claudiofahey requested a review from tkaitchuck March 27, 2021 01:56
createStreams();
final Random rand = new Random(42);
try (final EventStreamClientFactory clientFactory = EventStreamClientFactory.withScope(getConfig().getScope(), clientConfig)) {
try (final EventStreamWriter<SampleEvent> writer = clientFactory.createEventWriter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't need to nest. These two items can be in the same try block.

long sum = 0;
for (; ; ) {
sequenceNumber++;
final String routingKey = String.format("%3d", rand.nextInt(1000));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random locals like this don't need to be declared final.

public class StreamProcessingTest {
static final Logger log = LoggerFactory.getLogger(StreamProcessingTest.class);

protected static final AtomicReference<SetupUtils> SETUP_UTILS = new AtomicReference<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad from. Please use a resource to avoid this pattern.

@RaulGracia
Copy link
Contributor

@claudiofahey can you address @tkaitchuck comments so we can close this one? I think we are close.

Position lastFlushedPosition = null;
try {
while (isRunning()) {
final EventRead<T> eventRead = reader.readNextEvent(readTimeoutMillis);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readNextEvent can throw a TruncatedDataException when StreamRetention kicks in. We would need to retry reading the event in such a scenario, right?

@RaulGracia
Copy link
Contributor

@claudiofahey any updates addressing @tkaitchuck feedback? This one is actually looks close to get merged, wondering if it is worth it to do a last push for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants