-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENG-2031: Stuck messages fix #5
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was one conflict in the CI workflow. I went with upstream's changes. This means we are building a much wider range of versions than we care about, but that's fine for now. We can targe back to our specific versions if necessary later.
This has helped narrow down the problem. Logging confirms that the processor does process all messages. My current hunch is that rate limiting is getting close incorrectly in some cases when we still have messages. Those messages eventually timeout and the channel gets closed. The point against this is that I don't think it matches what I'm seeing in traces in Datadog. It could be that I'm misinterpreting the traces though.
CuriousCurmudgeon
force-pushed
the
ENG-2031-stuck-messages-fix
branch
from
October 15, 2024 20:31
2b8f69d
to
1eaa4ce
Compare
This does not work as expected yet. I'm not seeing any logs. I probably have something wrong in the pattern match. The shape of the data is different in the producer stage.
I can now see that only one message is being pulled off each time rate_limit_and_buffer_messages is called. That's surprising. The buffer length is only 1 in the scenarios I'm debugging. It does emit 10 messages when the buffer length is longer though.
This is an ugly fix that I'm going to try and simplify. The blast radius for these changes is larger than necessary, so I want to try a simpler fix.
When the next message is too large to send, this version attempts to set subtract the remaining demand from the rate limit to set it to 0. This triggers the rate limit to close correctly. I'm still seeing the Bandwidth SMS queue stuck locally, but the T-Mobile queue is flowing correctly.
Because there could be left over demand that we want to zero out even if no messages are being processed, the no-op function head needs removed. I also noted that we should get rid of the utility module and all of it's logging ASAP. We just need to be confident our fix is working first
There is a broken unit test in the latest build. This is because of the rate limit behavior change. We'll want to fix this and add more tests for the new behavior when the next message is too large to use the entire limit. |
mus0u
approved these changes
Oct 17, 2024
i think this test fix i just pushed up should be OK. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Jira Ticket Link
ENG-2031
What is this?
Fixes a bug in our Broadway weighted rate limiting. If the next message in the queue was too large to process with the remaining rate limit in the interval, but the rate limit was not 0 yet, Broadway would stop processing messages. The code was assuming that the remaining rate limit would always get to 0. That was true before weights were added.
The fix was to keep track of the buffer length when no messages were emittable. That means there is a message in the buffer, but it's too large to process right now. In that case, we zero out the remaining rate limit. This causes Broadway to close the rate limit like normal and messages continue to process.
There is a lot of extra logging in here. I'm leaving it for now just in case we see more issues in prod. Once we're comfortable that this is working properly, we can circle back and remove the logging.
Check for completion
I've considered these items and added them to my PR when necessary:
Steps For QA / Manual Testing
Screenshots/Notes