ENG-2031: Stuck messages fix #5

CuriousCurmudgeon · 2024-10-15T14:19:47Z

Jira Ticket Link

ENG-2031

What is this?

Fixes a bug in our Broadway weighted rate limiting. If the next message in the queue was too large to process with the remaining rate limit in the interval, but the rate limit was not 0 yet, Broadway would stop processing messages. The code was assuming that the remaining rate limit would always get to 0. That was true before weights were added.

The fix was to keep track of the buffer length when no messages were emittable. That means there is a message in the buffer, but it's too large to process right now. In that case, we zero out the remaining rate limit. This causes Broadway to close the rate limit like normal and messages continue to process.

There is a lot of extra logging in here. I'm leaving it for now just in case we see more issues in prod. Once we're comfortable that this is working properly, we can circle back and remove the logging.

Check for completion

I've considered these items and added them to my PR when necessary:

Steps For QA / Manual Testing

Screenshots/Notes

There was one conflict in the CI workflow. I went with upstream's changes. This means we are building a much wider range of versions than we care about, but that's fine for now. We can targe back to our specific versions if necessary later.

This has helped narrow down the problem. Logging confirms that the processor does process all messages. My current hunch is that rate limiting is getting close incorrectly in some cases when we still have messages. Those messages eventually timeout and the channel gets closed. The point against this is that I don't think it matches what I'm seeing in traces in Datadog. It could be that I'm misinterpreting the traces though.

This does not work as expected yet. I'm not seeing any logs. I probably have something wrong in the pattern match. The shape of the data is different in the producer stage.

I can now see that only one message is being pulled off each time rate_limit_and_buffer_messages is called. That's surprising. The buffer length is only 1 in the scenarios I'm debugging. It does emit 10 messages when the buffer length is longer though.

This is an ugly fix that I'm going to try and simplify. The blast radius for these changes is larger than necessary, so I want to try a simpler fix.

When the next message is too large to send, this version attempts to set subtract the remaining demand from the rate limit to set it to 0. This triggers the rate limit to close correctly. I'm still seeing the Bandwidth SMS queue stuck locally, but the T-Mobile queue is flowing correctly.

Because there could be left over demand that we want to zero out even if no messages are being processed, the no-op function head needs removed. I also noted that we should get rid of the utility module and all of it's logging ASAP. We just need to be confident our fix is working first

CuriousCurmudgeon · 2024-10-17T02:38:59Z

There is a broken unit test in the latest build. This is because of the rate limit behavior change. We'll want to fix this and add more tests for the new behavior when the next message is too large to use the entire limit.

mus0u · 2024-10-17T23:57:08Z

i think this test fix i just pushed up should be OK.

elliotekj and others added 5 commits May 24, 2024 10:27

add off_broadway_memory (#338)

3ca2da9

Update Erlang/Elixir in CI (#339)

90defc8

Support GenStage v1.2.0

83a564b

Release v1.1.0

3698e41

CuriousCurmudgeon self-assigned this Oct 15, 2024

CuriousCurmudgeon force-pushed the ENG-2031-stuck-messages-fix branch from 2b8f69d to 1eaa4ce Compare October 15, 2024 20:31

CuriousCurmudgeon added 7 commits October 15, 2024 16:57

ENG-2031: Attempting to add more logging to the producer stage

ff534c0

This does not work as expected yet. I'm not seeing any logs. I probably have something wrong in the pattern match. The shape of the data is different in the producer stage.

ENG-2031: Typo fix in new logging

1e9dc83

ENG-2031: WIP to try and get rate limiting to close correctly

a5b9576

ENG-2031: Fixed the bug!

0a525bd

This is an ugly fix that I'm going to try and simplify. The blast radius for these changes is larger than necessary, so I want to try a simpler fix.

mus0u approved these changes Oct 17, 2024

View reviewed changes

allow demand 0 in rate_limit/2

0c97ece

CuriousCurmudgeon marked this pull request as ready for review October 18, 2024 12:57

CuriousCurmudgeon merged commit 13b2730 into main Oct 18, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENG-2031: Stuck messages fix #5

ENG-2031: Stuck messages fix #5

CuriousCurmudgeon commented Oct 15, 2024 •

edited by jira bot

Loading

CuriousCurmudgeon commented Oct 17, 2024

mus0u commented Oct 17, 2024

ENG-2031: Stuck messages fix #5

ENG-2031: Stuck messages fix #5

Conversation

CuriousCurmudgeon commented Oct 15, 2024 • edited by jira bot Loading

Jira Ticket Link

What is this?

Check for completion

Steps For QA / Manual Testing

Screenshots/Notes

CuriousCurmudgeon commented Oct 17, 2024

mus0u commented Oct 17, 2024

CuriousCurmudgeon commented Oct 15, 2024 •

edited by jira bot

Loading