Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msglist: Throttle fetchOlder retries #1050

Merged
merged 3 commits into from
Dec 11, 2024
Merged

msglist: Throttle fetchOlder retries #1050

merged 3 commits into from
Dec 11, 2024

Conversation

PIG208
Copy link
Member

@PIG208 PIG208 commented Nov 6, 2024

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with connection.takeRequests
without checking, as we are only interested in verifying that no request
was sent.

Fixes: #945

@PIG208 PIG208 force-pushed the pr-storm branch 4 times, most recently from 567be0d to 4bf2482 Compare November 6, 2024 23:38
@PIG208 PIG208 added the maintainer review PR ready for review by Zulip maintainers label Nov 6, 2024
@chrisbobbe
Copy link
Collaborator

(Rerunning CI.)

@PIG208 PIG208 force-pushed the pr-storm branch 2 times, most recently from 8b14db4 to 62e334c Compare November 7, 2024 22:41
@PIG208 PIG208 requested a review from chrisbobbe November 7, 2024 22:55
Copy link
Collaborator

@chrisbobbe chrisbobbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good to get this bug fixed. Comments below.

Also, this change breaks an invariant we had before: whenever the top of the message list is scrolled into view, we would show a "start marker", which was either a loading indicator or some text saying there aren't any older messages to load.

We currently only show the loading indicator when fetchingOlder is true, and that's false when we're in the new "cooldown" period after a failed fetch-older request. How about also showing the loading indicator during the cooldown period? When a request is failing over and over, the cooldown period will quickly become multiple seconds long, with fetchingOlder flickering to true in between for perhaps milliseconds at a time; possibly not long enough for a frame.

Then, either here or as a followup, we could give the UI an "error"/"problem" state that's distinct from the loading state. If the request has failed, and especially if it's failed several times, that's an important sign that maybe the user should stop expecting it to succeed. Maybe MessageListLoadingItem could get another param like bool problem, for which we pass true when the BackoffMachine.waitsCompleted exceeds a certain value, like 4. And in that case make the loading indicator look like

image

or something, instead of just

image

.

Comment on lines 499 to 500
bool _waitBeforeRetry = false;
BackoffMachine? _backoffMachine;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can find a more fitting name than _waitBeforeRetry. I think that name might make it harder to see that no code is actually scheduling a retry after waiting for anything. It's also not clear from the name that it's about fetching the next batch of older messages, i.e., fetchOlder.

What if we:

  • Move this up near fetchingOlder, which is related and has a similar role
  • Make this public, with a name like fetchOlderCoolingDown
  • Give this a dartdoc and expand on fetchingOlder's dartdoc to make it clearer that they have similar roles

So, for example:

--- lib/model/message_list.dart
+++ lib/model/message_list.dart
@@ -92,9 +92,30 @@ mixin _MessageSequence {
   bool _haveOldest = false;
 
   /// Whether we are currently fetching the next batch of older messages.
+  ///
+  /// When this is true, [fetchOlder] is a no-op.
+  /// That method is called frequently by Flutter's scrolling logic,
+  /// and this field helps us avoid spamming the same request just to get
+  /// the same response each time.
+  ///
+  /// See also [fetchOlderCoolingDown].
   bool get fetchingOlder => _fetchingOlder;
   bool _fetchingOlder = false;
 
+  /// Whether [fetchOlder] had a request error recently.
+  ///
+  /// When this is true, [fetchOlder] is a no-op.
+  /// That method is called frequently by Flutter's scrolling logic,
+  /// and this field mitigates spamming the same request and getting
+  /// the same error each time.
+  ///
+  /// "Recently" is decided by a [BackoffMachine] that resets
+  /// when a [fetchOlder] request succeeds.
+  ///
+  /// See also [fetchingOlder].
+  bool get fetchOlderCoolingDown => _fetchOlderCoolingDown;
+  bool _fetchOlderCoolingDown = false;
+
   /// The parsed message contents, as a list parallel to [messages].
   ///
   /// The i'th element is the result of parsing the i'th element of [messages].

@@ -528,6 +545,7 @@ class MessageListView with ChangeNotifier, _MessageSequence {

_insertAllMessages(0, fetchedMessages);
_haveOldest = result.foundOldest;
_backoffMachine = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also do this in _reset?

@PIG208

This comment was marked as outdated.

@PIG208
Copy link
Member Author

PIG208 commented Nov 12, 2024

The PR has been updated with the proposed changes, and additionally, a check for this.generation == generation to avoid potential races after adding _updateEndMarkers and notifyListeners to the backoff callback.

I have also moved the backoff machine variable, renamed to _fetchOlderCooldownBackoffMachine with Cooldown as a noun, to _MessageSequence right next to fetchOlderCoolingDown, because they are relevant to each other. The backoff machine does not have a public getter and we keep it private.

The loading indicator change might be out of scope for this PR, and we should work on that as a follow-up.

@chrisbobbe
Copy link
Collaborator

Looks like CI is failing, could you take a look?

@PIG208
Copy link
Member Author

PIG208 commented Nov 13, 2024

Updated the PR. Thanks!

Copy link
Collaborator

@chrisbobbe chrisbobbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Small comments below.

Comment on lines 534 to 579
// ignore: control_flow_in_finally
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When disabling a lint rule (especially one that says it's meant to prevent bugs or unexpected behavior), please explain in a code comment why it's OK.

checkNotified(count: 2);
check(model).fetchOlderCoolingDown.isTrue();

connection.takeRequests();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msglist: Throttle fetchOlder retries

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with `connection.takeRequests`
without checking, as we are only interested in verifying that no request
was sent.

Fixes: https://github.com/zulip/zulip-flutter/issues/945

Signed-off-by: Zixuan James Li <[email protected]>

The paragraph about the tests is a pretty small implementation detail that I think doesn't need a special mention in the commit message. On the other hand, it seems maybe helpful as an implementation comment directly on the code it's about.

Comment on lines 1032 to 1045
await store.handleEvent(eg.updateMessageEventMoveTo(
origTopic: movedMessages[0].topic,
origStreamId: otherStream.streamId,
newMessages: movedMessages,
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently respond to this event by calling _reset, but there's a comment suggesting we might stop doing that in some cases:

  void _messagesMovedIntoNarrow() {
    // If there are some messages we don't have in [MessageStore], and they
    // occur later than the messages we have here, then we just have to
    // re-fetch from scratch.  That's always valid, so just do that always.
    // TODO in cases where we do have data to do better, do better.
    _reset();
    notifyListeners();
    fetchInitial();
  }

The test is meant to exercise the _reset case (as it says in its description); can we make it fail if it doesn't actually exercise that? Maybe:

        // check that _reset was called
        check(model).fetched.isFalse();

@PIG208
Copy link
Member Author

PIG208 commented Dec 6, 2024

Thanks for the review! This has been updated.

Copy link
Collaborator

@chrisbobbe chrisbobbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One comment below.

Comment on lines 573 to 579
// We need the finally block always clean up regardless of errors
// occured in the try block, and returning early here is necessary
// if such cleanup must be skipped, as the fetch is considered stale.
// ignore: control_flow_in_finally
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      if (this.generation != generation) {
        // We need the finally block always clean up regardless of errors
        // occured in the try block, and returning early here is necessary
        // if such cleanup must be skipped, as the fetch is considered stale.
        // ignore: control_flow_in_finally
        return;

This explanation doesn't make sense to me. The cleanup "always" needs to be done…but there are times when it "must be skipped"? Your reasoning might be correct but I don't understand it yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reminder, the task is to explain why the problems flagged by control_flow_in_finally either don't exist or are acceptable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

        // This lint rule is more helpful for flagging confusing uses of
        // control flow keywords, such as `break` in a block nested in a loop,
        // or `return` a value in a finally block when a value would have been
        // returned in the `try` block.  Returning early to skip the cleanup
        // is none of these.
        // ignore: control_flow_in_finally

I was also tempted to refactor the if block into if (this.generation == generation) instead of having an early return, but the code will become more nested.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, if (this.generation == generation) might actually be nicer. I guess it does basically the same thing, but without making the reader think through a lint rule and whether it's helpful here or not. I'd like to here @gnprice's thoughts; I'll mark for his review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The version in main uses if (this.generation == generation) here instead of our usual early return, and I think this lint rule was the main reason I went for that.

One extra layer of nesting is pretty tolerable, particularly as the indentation size is just 2 spaces.

@chrisbobbe chrisbobbe added integration review Added by maintainers when PR may be ready for integration and removed maintainer review PR ready for review by Zulip maintainers labels Dec 10, 2024
@chrisbobbe chrisbobbe assigned gnprice and unassigned chrisbobbe Dec 10, 2024
@chrisbobbe chrisbobbe requested a review from gnprice December 10, 2024 03:09
@chrisbobbe
Copy link
Collaborator

Marking for Greg's review; Greg, please see in particular a question above: #1050 (comment)

@@ -528,9 +567,21 @@ class MessageListView with ChangeNotifier, _MessageSequence {

_insertAllMessages(0, fetchedMessages);
_haveOldest = result.foundOldest;
_fetchOlderCooldownBackoffMachine = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's put this line that ends a backoff series right next to the logic that starts a backoff — as an else to the if (hasFetchError).

I believe that's equivalent to this way. And that way it's somewhat easier to see how this whole cooldown state machine operates.

Comment on lines 322 to 325
final startMarker = switch ((fetchingOlder, haveOldest, fetchOlderCoolingDown)) {
(true, _, _) => const MessageListLoadingItem(MessageListDirection.older),
(_, true, _) => const MessageListHistoryStartItem(),
(_, _, true) => const MessageListLoadingItem(MessageListDirection.older),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the fact these two fetchOlder-related conditions produce the same result isn't a coincidence; so let's handle them together.

… Oh hmm but is the point that the order of these cases is significant? Is it possible to have haveOldest and fetchOlderCoolingDown both true?

If that is possible then I guess let's make that explicit. In the existing version in main, the assert(!(haveOldest && fetchingOlder)) is there precisely to reassure the reader that the order of the existing two cases doesn't matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in fact it's the case in this revision that haveOldest and fetchOlderCoolingDown can't both be true. Which is good — it seems like that'd be a confused state to be in.

So let's (a) assert that, and (b) combine the two fetchOlder-related cases here. Could have a line like

  final effectiveFetchingOlder = fetchingOlder || fetchOlderCoolingDown;

and then the same switch cases as in main.

@@ -1793,7 +1869,7 @@ void checkInvariants(MessageListView model) {
if (model.haveOldest) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apropos of my previous comment #1050 (comment) : there's a new invariant, so let's have this checkInvariants function check that invariant.

Comment on lines 573 to +579
_fetchingOlder = false;
if (hasFetchError) {
assert(!fetchOlderCoolingDown);
_fetchOlderCoolingDown = true;
unawaited((_fetchOlderCooldownBackoffMachine ??= BackoffMachine())
.wait().then((_) {
if (this.generation != generation) return;
_fetchOlderCoolingDown = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps in fact we can avoid introducing more flags in the first place. How about we fuse the new flag with the old one — or put another way, instead of adding a new flag, we adjust the semantics of the existing fetchingOlder? Like this:

Suggested change
_fetchingOlder = false;
if (hasFetchError) {
assert(!fetchOlderCoolingDown);
_fetchOlderCoolingDown = true;
unawaited((_fetchOlderCooldownBackoffMachine ??= BackoffMachine())
.wait().then((_) {
if (this.generation != generation) return;
_fetchOlderCoolingDown = false;
if (!hasFetchError) {
_fetchingOlder = false;
} else {
unawaited((_fetchOlderCooldownBackoffMachine ??= BackoffMachine())
.wait().then((_) {
if (this.generation != generation) return;
_fetchingOlder = false;

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to you both! Here's the rest of a full review.

checkNotified(count: 2);
check(model).fetchOlderCoolingDown.isTrue();

// Drop irrelevant requests with without checking,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// Drop irrelevant requests with without checking,
// Drop irrelevant requests without checking,

Comment on lines 255 to 257
// Drop irrelevant requests with without checking,
// as we are only interested in verifying if any request was sent.
connection.takeRequests();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line and its comment are a bit puzzling. We're "interested in verifying if any request was sent" — but does this verify that? It looks like it just discards that information.

// The first backoff is expected to be short enough to complete.
async.elapse(const Duration(seconds: 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The first backoff is expected to be short enough to complete.
async.elapse(const Duration(seconds: 1));
// Wait long enough that a first backoff is sure to finish.
async.elapse(const Duration(seconds: 1));

"short enough to complete" sounded to me like saying it's possible to complete it. But the point here is that this specific line definitely does complete it.

check(connection.lastRequest).isNotNull();
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this is really the line that comment is most about.

It's fine to not inspect the details of the request. Let's include a check on model.messages.length, though — that gives the reader of the test a nice quick legible confirmation that we ended up getting all the messages, and therefore must have made the usual request and handled the response in the usual way.

connection.prepare(httpStatus: 400, json: {
'result': 'error', 'code': 'BAD_REQUEST', 'msg': 'Bad request'});
await check(model.fetchOlder()).throws<ZulipApiException>();
final backoffTimer = async.pendingTimers.single;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, clever!

// check that _reset was caleed
check(model).fetched.isFalse();
check(model).fetchOlderCoolingDown.isFalse();
check(async.pendingTimers).contains(backoffTimer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
check(async.pendingTimers).contains(backoffTimer);
check(backoffTimer.isActive()).isTrue();

Comment on lines 1054 to 1057
async.elapse(const Duration(seconds: 1));
check(model).fetchOlderCoolingDown.isFalse();
check(async.pendingTimers).not((x) => x.contains(backoffTimer));
checkNotNotified();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing a test case, ideally we think through what scenarios would really stress the logic, and in particular would show clearly buggy symptoms if we had various plausible bugs.

So for example in the existing test just above this one, we let the fetch complete, and then check the model has no messages in its list:

        await fetchFuture;
        checkHasMessages([]);
        checkNotNotified();

(and then call checkNotNotified, mainly to ensure that checkInvariants gets called). For that test case that's good because if the fetchOlder call finished and then carried on, forgetting to check generation, then the symptom would be that it puts the fetched messages into the list.

Here, the current revision of this test checks that fetchOlderCoolingDown is still false. But that's not very informative, because although false is the value that flag had before this step, it's also exactly what the code under test would set the flag to if it did suffer from the race.

So to make a version of this test that really stresses the logic as it should, we should arrange a scenario where by the time the backoff completes, the flag has been set to true. Then we get to check here that it's still true — which confirms we don't have the bug where the code cheerfully sets it to false heedless of the new generation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a new test, "fetchOlder backoff A starts, _reset, move fetch finishes, fetchOlder backoff B starts, fetchOlder backoff A ends", for this. This also needed a refactor to make it easier to test with BackoffMachine, by offering a way to override the wait duration.

@@ -985,6 +1022,45 @@ void main() {
checkNotifiedOnce();
}));

test('fetchOlder backoff start, _reset, fetchOlder backoff ends, move fetch finishes', () => awaitFakeAsync((async) async {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's put this after both the existing "fetchOlder, _reset" test cases, instead of between the two of them. Those two are very similar and I feel like I want to read them together, comparing them.

@chrisbobbe
Copy link
Collaborator

Then, either here or as a followup, we could give the UI an "error"/"problem" state that's distinct from the loading state. If the request has failed, and especially if it's failed several times, that's an important sign that maybe the user should stop expecting it to succeed.

(Filed #1126 for this.)

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision! All looks good now except three small comments.

_fetchOlderCoolingDown = true;
unawaited((_fetchOlderCooldownBackoffMachine ??= BackoffMachine())
.wait().then((_) {
if (this.generation != generation) return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if (this.generation != generation) return;
if (this.generation > generation) return;

(matching the similar checks elsewhere)

Comment on lines +272 to +275
await model.fetchOlder();
checkNotified(count: 2);
check(connection.takeRequests()).single;
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1107 to +1121
// When `backoffTimerA` ends, `fetchOlderCoolingDown` remains `true`
// because the backoff was from a previous generation.
async.elapse(const Duration(seconds: 1));
check(model).fetchOlderCoolingDown.isTrue();
check(backoffTimerA.isActive).isFalse();
check(backoffTimerB.isActive).isTrue();
checkNotNotified();

// When `backoffTimerB` ends, `fetchOlderCoolingDown` gets reset.
async.elapse(const Duration(seconds: 1));
check(model).fetchOlderCoolingDown.isFalse();
check(backoffTimerA.isActive).isFalse();
check(backoffTimerB.isActive).isFalse();
checkNotifiedOnce();
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new test looks good!

Comment on lines +88 to +89
final duration = debugDuration ?? _maxDuration(const Duration(microseconds: 1),
bound * Random().nextDouble());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: line too long (and in particular the value of microseconds is past 80 columns, just)

@PIG208
Copy link
Member Author

PIG208 commented Dec 11, 2024

Updated the PR. Thanks for the review!

@gnprice
Copy link
Member

gnprice commented Dec 11, 2024

Thanks! Looks good — rebased, and will watch CI.

This already holds for the existing callers.  Updating end markers
should only happen after the initial fetch.  During the initial fetch,
we have a separate loading indicator and no end markers.

Signed-off-by: Zixuan James Li <[email protected]>
This will be used for testing.

Signed-off-by: Zixuan James Li <[email protected]>
This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
@gnprice gnprice merged commit a7689da into zulip:main Dec 11, 2024
@PIG208 PIG208 deleted the pr-storm branch December 11, 2024 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration review Added by maintainers when PR may be ready for integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry storm on fetchOlder in message list
3 participants