Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENG-2031: Stuck messages fix #5

Merged
merged 14 commits into from
Oct 18, 2024
Merged

Conversation

CuriousCurmudgeon
Copy link
Collaborator

@CuriousCurmudgeon CuriousCurmudgeon commented Oct 15, 2024

Jira Ticket Link

ENG-2031

What is this?

Fixes a bug in our Broadway weighted rate limiting. If the next message in the queue was too large to process with the remaining rate limit in the interval, but the rate limit was not 0 yet, Broadway would stop processing messages. The code was assuming that the remaining rate limit would always get to 0. That was true before weights were added.

The fix was to keep track of the buffer length when no messages were emittable. That means there is a message in the buffer, but it's too large to process right now. In that case, we zero out the remaining rate limit. This causes Broadway to close the rate limit like normal and messages continue to process.

There is a lot of extra logging in here. I'm leaving it for now just in case we see more issues in prod. Once we're comfortable that this is working properly, we can circle back and remove the logging.

Check for completion

I've considered these items and added them to my PR when necessary:

  • Specs
  • Docs
  • Test coverage
  • Security
  • Works across restarts/deploys

Steps For QA / Manual Testing

Screenshots/Notes

elliotekj and others added 5 commits May 24, 2024 10:27
There was one conflict in the CI workflow. I went with upstream's
changes. This means we are building a much wider range of versions than
we care about, but that's fine for now. We can targe back to our specific
versions if necessary later.
@CuriousCurmudgeon CuriousCurmudgeon self-assigned this Oct 15, 2024
This has helped narrow down the problem. Logging confirms that the
processor does process all messages. My current hunch is that rate
limiting is getting close incorrectly in some cases when we still have
messages. Those messages eventually timeout and the channel gets closed.

The point against this is that I don't think it matches what I'm seeing
in traces in Datadog. It could be that I'm misinterpreting the traces
though.
@CuriousCurmudgeon CuriousCurmudgeon force-pushed the ENG-2031-stuck-messages-fix branch from 2b8f69d to 1eaa4ce Compare October 15, 2024 20:31
This does not work as expected yet. I'm not seeing any logs. I probably
have something wrong in the pattern match. The shape of the data
is different in the producer stage.
I can now see that only one message is being pulled off each time
rate_limit_and_buffer_messages is called. That's surprising. The buffer
length is only 1 in the scenarios I'm debugging. It does emit 10 messages
when the buffer length is longer though.
This is an ugly fix that I'm going to try and simplify. The blast
radius for these changes is larger than necessary, so I want to try
a simpler fix.
When the next message is too large to send, this version attempts to
set subtract the remaining demand from the rate limit to set it to 0.
This triggers the rate limit to close correctly.

I'm still seeing the Bandwidth SMS queue stuck locally, but the T-Mobile
queue is flowing correctly.
Because there could be left over demand that we want to zero out even
if no messages are being processed, the no-op function head needs
removed.

I also noted that we should get rid of the utility module and all of
it's logging ASAP. We just need to be confident our fix is working first
@CuriousCurmudgeon
Copy link
Collaborator Author

There is a broken unit test in the latest build. This is because of the rate limit behavior change. We'll want to fix this and add more tests for the new behavior when the next message is too large to use the entire limit.

@mus0u
Copy link
Contributor

mus0u commented Oct 17, 2024

i think this test fix i just pushed up should be OK.

@CuriousCurmudgeon CuriousCurmudgeon marked this pull request as ready for review October 18, 2024 12:57
@CuriousCurmudgeon CuriousCurmudgeon merged commit 13b2730 into main Oct 18, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants