Process Repeaters, Part 1 #35033

kaapstorm · 2024-08-24T01:13:23Z

Technical Summary

At the moment, repeat records are processed independently, and this prevents us from making better decisions about the APIs that we are sending data to. e.g. If a remote server is offline, we will keep trying to send new payloads to it even if all the previous payloads failed.

The approach taken in this PR intends to process repeaters smarter, and also to share repeat record queue workers more fairly across domains. It honors rate limiting by skipping repeaters or domains when they are rate-limited.

Context:

Jira: SC-3871
Spec
Previous (draft) PR check_repeaters() task iterates repeaters #34946

This branch implements the feedback given in the draft PR #34946 with commits squashed and rebased. 🐟 🐠 🐬

Almost all new functionality is behind a feature flag. The one change that is not is that up to once a minute (less often if there are repeaters being processed) we loop through the results of RepeaterManager.all_ready(). (That happens here.)

Enabling the "process_repeaters" feature flag will switch a domain from the current approach to the new approach. The domain can be switched back to the current approach without consequences.

There are two migrations in this PR. I have kept them separate to make it easier to modify the second one. I am looking for confirmation that the indexes in Index fields used by RepeaterManager.all_ready() are correct.

Currently there are no new Datadog metrics in this PR. I expect that as we test this more / switch domains over, we will add metrics in this PR and follow-up PRs.

Feature Flag

process_repeaters

Safety Assurance

Safety story

Tested locally and on Staging.
Behavior is behind a feature flag.

Automated test coverage

Includes coverage. Please highlight any gaps.

QA Plan

QA requested. Ticket pending.

Migrations

The migrations in this code can be safely applied first independently of the code

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

snopoke · 2024-08-28T07:38:49Z

corehq/motech/repeaters/migrations/0015_alter_repeater_is_paused_and_more.py

Indexes on boolean fields typically don't get used but a useful approach can be to use a partial index to match the query you are trying to optimize. In this case I think it it may work to create a partial index on next_attempt_at with a condition for is_paused=False.

Like so? d8d9642

Yes. Still worth testing that to make sure the query uses it.

corehq/motech/repeaters/tasks.py

gherceg

First pass, didn't dig into tests much.

corehq/motech/repeaters/models.py

gherceg · 2024-09-02T20:41:16Z

corehq/motech/repeaters/tasks.py

+        for domain, repeater_id, lock_token in iter_ready_repeater_ids_forever():
+            process_repeater.delay(domain, repeater_id, lock_token)


Do we need to be checking how long this task has been running for and exit if we are about to exceed the process_repeater_lock timeout value? I assume it is very unlikely that there are always repeaters that we can iterate over since once we kick them off, they will no longer be returned in this query, but is it safe to still have a check of some sort?

Do we need to be checking how long this task has been running for and exit if we are about to exceed the process_repeater_lock timeout value?

process_repeater() should take a maximum of five minutes, if that repeater's requests are timing out. The lock timeout is three hours.

I expect we would only exceed the timeout if Celery was killed and the update_repeater() task was never called. So the timeout is intended to prevent a repeater from being locked forever.

I think it could be useful to use Datadog to monitor how long the task takes, but probably only to identify problematic endpoints, not to exit the task.

I may not be understanding, but I'm referring to the process_repeaters task and the process_repeater_lock with the 24 hour timeout.

Sorry for my confusion. I didn't read your question properly.

I wondered about that timeout too, because there is no problem if this loop just keeps going for more than 24 hours as long as new repeat records are being sent in a reasonable time.

What do you think of not having a timeout?

corehq/motech/repeaters/tasks.py

gherceg · 2024-09-02T22:36:08Z

settings.py

+# repeater can use to send repeat records at the same time. This is a
+# guardrail to prevent one repeater from hogging repeat_record_queue
+# workers and to ensure that repeaters are iterated fairly.
+MAX_REPEATER_WORKERS = 144


This is a function of # of celery workers right? It is out of scope of this PR, but it would be pretty cool if we could figure out how many workers celery has dedicated to a specific queue and set this to some percentage of that (looks like that is 50% at the moment).

Yeah, I agree. It would be excellent to be able to calculate this value.

(It is 144 here because that was the number of workers at the time of writing. It was recently doubled.)

kaapstorm · 2024-10-20T00:54:19Z

corehq/motech/repeaters/migrations/0016_add_indexes.py

+                    CREATE INDEX CONCURRENTLY "next_attempt_at_partial_idx"
+                    ON "repeaters_repeater" ("next_attempt_at")
+                    WHERE (NOT "is_deleted" AND NOT "is_paused");


I analyzed the query used by Repeater.objects.get_all_ready_ids_by_domain(), and this is the result on Staging:

> EXPLAIN ANALYZE SELECT "repeaters_repeater"."domain", "repeaters_repeater"."id_" FROM "repeaters_repeater" INNER JOIN "repeaters_repeatrecord" ON ("repeaters_repeater"."id_" = "repeaters_repeatrecord"."repeater_id_") WHERE ( NOT "repeaters_repeater"."is_deleted" AND NOT "repeaters_repeater"."is_paused" AND ( "repeaters_repeater"."next_attempt_at" IS NULL OR "repeaters_repeater"."next_attempt_at" <= '2024-10-19 22:11:36.310082' ) AND "repeaters_repeatrecord"."state" IN (1, 2) ); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=8423.11..141317.33 rows=52115 width=30) (actual time=61.472..1547.920 rows=638862 loops=1) Workers Planned: 2 Workers Launched: 2 -> Hash Join (cost=7423.11..135105.83 rows=21715 width=30) (actual time=56.214..1224.831 rows=212954 loops=3) Hash Cond: (repeaters_repeatrecord.repeater_id_ = repeaters_repeater.id_) -> Parallel Bitmap Heap Scan on repeaters_repeatrecord (cost=7289.92..134244.51 rows=276830 width=16) (actual time=52.249..1077.808 rows=223024 loops=3) Recheck Cond: (state = ANY ('{1,2}'::integer[])) Heap Blocks: exact=16852 -> Bitmap Index Scan on repeaters_repeatrecord_state_8055083b (cost=0.00..7123.82 rows=664393 width=0) (actual time=44.255..44.256 rows=669452 loops=1) Index Cond: (state = ANY ('{1,2}'::integer[])) -> Hash (cost=131.11..131.11 rows=167 width=30) (actual time=0.713..0.722 rows=163 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 18kB -> Bitmap Heap Scan on repeaters_repeater (cost=5.84..131.11 rows=167 width=30) (actual time=0.089..0.660 rows=163 loops=3) Filter: ((NOT is_deleted) AND (NOT is_paused) AND ((next_attempt_at IS NULL) OR (next_attempt_at <= '2024-10-19 22:11:36.310082+00'::timestamp with time zone))) Rows Removed by Filter: 40 Heap Blocks: exact=67 -> Bitmap Index Scan on repeaters_repeater_is_deleted_08441bf0 (cost=0.00..5.80 rows=203 width=0) (actual time=0.053..0.059 rows=253 loops=3) Index Cond: (is_deleted = false) Planning Time: 0.874 ms Execution Time: 1623.954 ms (20 rows)

The "next_attempt_at_partial_idx" was chosen to match the filter in that query plan, but Postgres is not using the index.

Should we drop it, change it, or leave it?

How much data is present on staging? If not much then those two indexes may have had very similar outcomes and the choice of which to use may not have been very meaningful.

I'd check the same or a similar query on prod to see if you get the same query plan.

~~Do we need both indexes?~~ Would it break other queries if we added the partial condition NOT "is_deleted" AND NOT "is_paused" on repeaters_repeatrecord_state_8055083b ~~and eliminated the next_attempt_at_partial_idx index~~?

Edit: realized that those indexes are on separate tables, so the question above did not make sense. Crossed out some bits, but I think the question of whether to add a condition to the repeatrecord index still makes sense.

Edit 2: I guess maybe that does not make sense since is_deleted and is_paused are not columns on the repeatrecord table. I think the gist of my question is whether there is a way to only index rows on that table that are active and eliminate the many millions of rows that will never be touched again?

"repeaters_repeater"."next_attempt_at" IS NULL OR "repeaters_repeater"."next_attempt_at" <= '2024-10-19 22:11:36.310082'

This condition could be an issue. The NULL and the range check can't be accomplished with a single index operation which may be why it's not using the index.

You could try split the query into two queries and doing a union of the results. It seems counter intuitive but if the indexes get used the performance will be better:

SELECT "repeaters_repeater"."domain", "repeaters_repeater"."id_" FROM "repeaters_repeater" INNER JOIN "repeaters_repeatrecord" ON ("repeaters_repeater"."id_" = "repeaters_repeatrecord"."repeater_id_") WHERE ( NOT "repeaters_repeater"."is_deleted" AND NOT "repeaters_repeater"."is_paused" AND "repeaters_repeater"."next_attempt_at" IS NULL AND "repeaters_repeatrecord"."state" IN (1, 2) ) UNION ALL SELECT "repeaters_repeater"."domain", "repeaters_repeater"."id_" FROM "repeaters_repeater" INNER JOIN "repeaters_repeatrecord" ON ("repeaters_repeater"."id_" = "repeaters_repeatrecord"."repeater_id_") WHERE ( NOT "repeaters_repeater"."is_deleted" AND NOT "repeaters_repeater"."is_paused" AND "repeaters_repeater"."next_attempt_at" <= '2024-10-19 22:11:36.310082' AND "repeaters_repeatrecord"."state" IN (1, 2) );

Interesting! Thank you!

Very nice. It uses the index.

QUERY PLAN ---------- Gather (cost=1078.24..298737.78 rows=91675 width=30) (actual time=1860.938..7291.760 rows=952533 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Append (cost=78.24..288570.28 rows=91675 width=30) (actual time=1103.767..5783.887 rows=317511 loops=3) -> Hash Join (cost=132.62..143624.77 rows=31897 width=30) (actual time=1.539..2456.512 rows=317502 loops=3) Hash Cond: (repeaters_repeatrecord.repeater_id_ = repeaters_repeater.id_) -> Parallel Seq Scan on repeaters_repeatrecord (cost=0.00..142389.57 rows=419192 width=16) (actual time=1.080..2293.795 rows=327791 loops=3) Filter: (state = ANY ('{1,2}'::integer[])) Rows Removed by Filter: 1776517 -> Hash (cost=130.60..130.60 rows=162 width=30) (actual time=0.357..0.359 rows=153 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 17kB -> Bitmap Heap Scan on repeaters_repeater (cost=5.84..130.60 rows=162 width=30) (actual time=0.067..0.295 rows=153 loops=3) Filter: ((NOT is_deleted) AND (NOT is_paused) AND (next_attempt_at IS NULL)) Rows Removed by Filter: 50 Heap Blocks: exact=67 -> Bitmap Index Scan on repeaters_repeater_is_deleted_08441bf0 (cost=0.00..5.80 rows=203 width=0) (actual time=0.044..0.045 rows=253 loops=3) Index Cond: (is_deleted = false) -> Hash Join (cost=78.24..143570.39 rows=6301 width=30) (actual time=1655.248..4904.097 rows=14 loops=2) Hash Cond: (repeaters_repeatrecord_1.repeater_id_ = repeaters_repeater_1.id_) -> Parallel Seq Scan on repeaters_repeatrecord repeaters_repeatrecord_1 (cost=0.00..142389.57 rows=419192 width=16) (actual time=0.043..4756.595 rows=491687 loops=2) Filter: (state = ANY ('{1,2}'::integer[])) Rows Removed by Filter: 2664776 -> Hash (cost=77.84..77.84 rows=32 width=30) (actual time=39.663..39.664 rows=10 loops=2) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Bitmap Heap Scan on repeaters_repeater repeaters_repeater_1 (cost=4.39..77.84 rows=32 width=30) (actual time=39.626..39.647 rows=10 loops=2) Recheck Cond: ((next_attempt_at <= '2024-10-19 22:11:36.310082+00'::timestamp with time zone) AND (NOT is_deleted) AND (NOT is_paused)) Heap Blocks: exact=7 -> Bitmap Index Scan on next_attempt_at_partial_idx (cost=0.00..4.38 rows=32 width=0) (actual time=39.577..39.577 rows=10 loops=2) Index Cond: (next_attempt_at <= '2024-10-19 22:11:36.310082+00'::timestamp with time zone) Planning Time: 3.019 ms Execution Time: 7411.459 ms (31 rows)

Out of curiosity, I'll try a composite index and see what difference it makes. Either way, this feels like a big step in the right direction.

With a composite index, planning time is 2.5 ms faster, but execution time is 363 ms slower. The partial index wins.

I found some changes that made a significant difference. efc4dde

The details are in the commit message:

After some analyzing, it turns out that the query is between two and three times as fast using a partial index on RepeatRecord.repeater_id where state is Pending or Failed. Performance is also improved using a partial index with Repeater.is_deleted instead of a normal B-tree index. When these two indexes are used, the next_attempt_at_partial_idx is not used.

I've dropped the next_attempt_at_partial_idx because of that, and changed the indexes for Repeater.state and Repeater.is_deleted to be partial.

I think Parallel Seq Scan on repeaters_repeatrecord is going to be a problem if that query plan is chosen on prod. In comparison, Index Scan on repeaters_repeater_... is inconsequential because there are only ~2000 (tiny number) repeaters whereas there are millions of repeat records.

Just to close the loop here: I tested the RepeatRecord.state index on Prod, and Postgres uses it for both subqueries. I tested on Friday afternoon, and the query took around 27 seconds.

I think the benefits of this PR are big enough for us to iterate on step-wise improvements after it is merged. One improvement I think we should consider is replacing the repeaters_repeatrecord table with nested partitioning / subpartitioning. Something like:

CREATE TABLE repeaters_repeatrecord_p ( -- ... ) PARTITION BY LIST (state); CREATE TABLE repeaters_repeatrecord_ready PARTITION OF repeaters_repeatrecord_p FOR VALUES IN (1, 2); CREATE TABLE repeaters_repeatrecord_done PARTITION OF repeaters_repeatrecord_p FOR VALUES IN NOT (1, 2) PARTITION BY RANGE (registered_on); CREATE TABLE repeaters_repeatrecord_2010 PARTITION OF repeaters_repeatrecord_done FOR VALUES FROM ('2010-01-01') TO ('2011-01-01'); -- etc.

This will improve queries on repeat records ready to be sent, and also allow us to drop completed repeat records after some expiry (TBD).

After some analyzing, it turns out that the query is between two and three times as fast using a partial index on RepeatRecord.repeater_id where state is pending or failed. Performance is also improved using a partial index with Repeater.is_deleted instead of a normal B-tree index. When these two indexes are used, the next_attempt_at_partial_idx is not used.

millerdev · 2024-12-05T21:02:28Z

corehq/motech/repeaters/migrations/0016_add_indexes.py

+                migrations.AddIndex(
+                    model_name="repeatrecord",
+                    index=models.Index(
+                        condition=models.Q(("state__in", (1, 2))),


I wonder what are the computational differences in Postgres for state IN (1, 2) vs state = 1 OR state = 2? I hope they are virtually the same, but it would be a shame if IN (...) does something complicated with a loop.

millerdev · 2024-12-05T21:05:06Z

corehq/motech/repeaters/migrations/0016_add_indexes.py

                    ),
                ),
            ],

            database_operations=[
+                # Drop `Repeater.id_deleted` index


Suggested change

# Drop `Repeater.id_deleted` index

# Drop `Repeater.is_deleted` index

millerdev · 2024-12-05T21:23:59Z

corehq/motech/repeaters/tests/test_tasks.py

+            registered_at=five_minutes_ago,
+        )
+        wait_duration = _get_wait_duration_seconds(repeat_record)
+        self.assertEqual(wait_duration, 300)


Could there be a race condition (resulting in a flaky test) if something took an extra second or two (like garbage collection) between when repeat_record is created and when _get_wait_duration_seconds is run? Seems unlikely, but not impossible. Maybe consider using something like freezegun.freeze_time?

The same applies to other tests in this class as well.

millerdev · 2024-12-05T21:34:57Z

corehq/motech/repeaters/tasks.py

+round-robin fashion, cycling through the domains. It does this so that:
+
+* Domains and repeaters are not rate-limited unnecessarily.
+* Remote APIs are not unintentionally DDoS-attacked by CommCare HQ.


nit: Technically this would be a "DoS", not a "DDoS".

millerdev · 2024-12-05T21:43:08Z

corehq/motech/repeaters/tasks.py

+set per repeater) and spawns tasks to send them in parallel. The results
+of all the send attempts are passed to ``update_repeater()``. If all the
+send attempts failed, the **repeater** (not the repeat record) is backed
+off. If any send attempts succeeded, the backoff is reset.


Imagine a scenario where the first attempt usually succeeds, but subsequent attempts usually fail. I think we would ideally want to reduce the batch size for such a fragile repeater endpoint, but I assume we have no automated way of doing that right now. Do we have any way of detecting that scenario?

dimagimon added the reindex/migration Reindex or migration will be required during or before deploy label Aug 24, 2024

kaapstorm force-pushed the nh/iter_repeaters_1 branch 7 times, most recently from e06a034 to e83e925 Compare August 27, 2024 14:39

kaapstorm added the product/feature-flag Change will only affect users who have a specific feature flag enabled label Aug 27, 2024

kaapstorm requested review from snopoke, millerdev and gherceg August 27, 2024 16:52

kaapstorm marked this pull request as ready for review August 27, 2024 16:52

kaapstorm mentioned this pull request Aug 27, 2024

check_repeaters() task iterates repeaters #34946

Closed

4 tasks

snopoke reviewed Aug 28, 2024

View reviewed changes

kaapstorm added 9 commits August 28, 2024 15:35

Add PROCESS_REPEATERS toggle

50c848a

process_repeaters() task

0db6a6c

get_repeater_lock()

e36296c

iter_ready_repeater_ids_once()

aeb10ba

Skip rate-limited repeaters

01e4bc7

process_repeater() task

db2fec2

Add tests

85b952e

Repeater.max_workers field

c28c11b

Index fields used by RepeaterManager.all_ready()

d8d9642

kaapstorm force-pushed the nh/iter_repeaters_1 branch from e83e925 to db093d0 Compare August 28, 2024 17:01

Use quickcache. Prefilter enabled domains.

48c3d7c

kaapstorm force-pushed the nh/iter_repeaters_1 branch from db093d0 to 48c3d7c Compare August 29, 2024 00:13

kaapstorm added 2 commits August 29, 2024 13:23

Check randomly-enabled domains

418ed3a

Forward new records for synchronous case repeaters

85bbfa3

gherceg reviewed Sep 2, 2024

View reviewed changes

Add metric for monitoring health

30d4a6f

kaapstorm force-pushed the nh/iter_repeaters_1 branch from 8da8851 to ceddbaa Compare October 19, 2024 15:49

Merge branch 'master' into nh/iter_repeaters_1

fc0f174

kaapstorm force-pushed the nh/iter_repeaters_1 branch from ceddbaa to ea918f3 Compare October 19, 2024 16:04

kaapstorm added 2 commits October 20, 2024 01:03

Resolve migration conflict, fix index

e32b465

Fix metric

e3bcd74

kaapstorm force-pushed the nh/iter_repeaters_1 branch from d35d19f to e3bcd74 Compare October 20, 2024 00:03

kaapstorm commented Oct 20, 2024

View reviewed changes

kaapstorm mentioned this pull request Oct 22, 2024

Migrations in preparation for process_repeaters() #35248

Merged

4 tasks

kaapstorm added 2 commits October 23, 2024 18:54

Add one more index. Use UNION ALL queries.

bd37a00

Don't report attempt too soon

a448b9e

kaapstorm force-pushed the nh/iter_repeaters_1 branch 5 times, most recently from 4bfb485 to 102d814 Compare October 26, 2024 23:21

kaapstorm added 2 commits October 27, 2024 16:42

Add metrics

4321fb7

Improve backoff logic

74137f9

kaapstorm force-pushed the nh/iter_repeaters_1 branch from 102d814 to 74137f9 Compare October 27, 2024 16:42

kaapstorm added 5 commits October 28, 2024 11:50

Update comments

968a922

Show "Next attempt at" in Forwarders page

4fd14a0

Merge branch 'master' into nh/iter_repeaters_1

07320b9

Fixes migration

b1eb171

Merge remote-tracking branch 'origin/master' into nh/iter_repeaters_1

2463348

dimagimon removed the reindex/migration Reindex or migration will be required during or before deploy label Oct 29, 2024

kaapstorm added 3 commits November 19, 2024 15:05

Add comment on other True return value

8a7f343

Count repeater backoffs

0f72ba9

Add documentation

7f18e52

millerdev reviewed Dec 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process Repeaters, Part 1 #35033

Process Repeaters, Part 1 #35033

kaapstorm commented Aug 24, 2024 •

edited by jira bot

Loading

snopoke Aug 28, 2024

kaapstorm Aug 28, 2024 •

edited

Loading

snopoke Aug 29, 2024

gherceg left a comment

gherceg Sep 2, 2024

kaapstorm Sep 6, 2024

gherceg Sep 6, 2024

kaapstorm Sep 9, 2024

kaapstorm Oct 19, 2024

gherceg Sep 2, 2024

kaapstorm Sep 6, 2024

kaapstorm Oct 20, 2024

millerdev Oct 21, 2024 •

edited

Loading

snopoke Oct 22, 2024

kaapstorm Oct 22, 2024

kaapstorm Oct 22, 2024

kaapstorm Oct 22, 2024

kaapstorm Oct 22, 2024

millerdev Oct 24, 2024

kaapstorm Oct 28, 2024

kaapstorm Oct 28, 2024

millerdev Dec 5, 2024

millerdev Dec 5, 2024

millerdev Dec 5, 2024 •

edited

Loading

millerdev Dec 5, 2024

millerdev Dec 5, 2024

		for domain, repeater_id, lock_token in iter_ready_repeater_ids_forever():
		process_repeater.delay(domain, repeater_id, lock_token)

	# Drop `Repeater.id_deleted` index
	# Drop `Repeater.is_deleted` index

Process Repeaters, Part 1 #35033

Are you sure you want to change the base?

Process Repeaters, Part 1 #35033

Conversation

kaapstorm commented Aug 24, 2024 • edited by jira bot Loading

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Migrations

Rollback instructions

Labels & Review

Choose a reason for hiding this comment

kaapstorm Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gherceg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

millerdev Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

millerdev Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaapstorm commented Aug 24, 2024 •

edited by jira bot

Loading

kaapstorm Aug 28, 2024 •

edited

Loading

millerdev Oct 21, 2024 •

edited

Loading

millerdev Dec 5, 2024 •

edited

Loading