Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process Repeaters, Part 1 #35033

Open
wants to merge 38 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
50c848a
Add `PROCESS_REPEATERS` toggle
kaapstorm Aug 23, 2024
0db6a6c
`process_repeaters()` task
kaapstorm Aug 23, 2024
e36296c
`get_repeater_lock()`
kaapstorm Aug 23, 2024
aeb10ba
`iter_ready_repeater_ids_once()`
kaapstorm Aug 23, 2024
01e4bc7
Skip rate-limited repeaters
kaapstorm Aug 23, 2024
db2fec2
`process_repeater()` task
kaapstorm Aug 23, 2024
85b952e
Add tests
kaapstorm Aug 4, 2024
c28c11b
`Repeater.max_workers` field
kaapstorm Aug 24, 2024
d8d9642
Index fields used by `RepeaterManager.all_ready()`
kaapstorm Aug 28, 2024
48c3d7c
Use quickcache. Prefilter enabled domains.
kaapstorm Aug 28, 2024
418ed3a
Check randomly-enabled domains
kaapstorm Aug 29, 2024
85bbfa3
Forward new records for synchronous case repeaters
kaapstorm Aug 29, 2024
d1119bb
Add explanatory docstrings and comments
kaapstorm Sep 9, 2024
03b26cf
get_redis_lock() ... acquire(): No TypeError ?!
kaapstorm Sep 9, 2024
de27ba0
Drop unnecessary `iter_domain_repeaters()`
kaapstorm Sep 10, 2024
4955ef4
Don't quickcache `domain_can_forward_now()`
kaapstorm Sep 24, 2024
59aae71
Migration to create indexes concurrently
kaapstorm Sep 24, 2024
f40e6f4
Merge branch 'master' into nh/iter_repeaters_1
orangejenny Oct 4, 2024
b70fc52
Add comment
kaapstorm Oct 19, 2024
7e65b3b
Don't squash BaseExceptions
kaapstorm Oct 19, 2024
4c41896
Drop timeout for `process_repeater_lock`.
kaapstorm Oct 19, 2024
30d4a6f
Add metric for monitoring health
kaapstorm Oct 19, 2024
fc0f174
Merge branch 'master' into nh/iter_repeaters_1
kaapstorm Oct 19, 2024
e32b465
Resolve migration conflict, fix index
kaapstorm Oct 19, 2024
e3bcd74
Fix metric
kaapstorm Oct 19, 2024
efc4dde
Change indexes
kaapstorm Oct 22, 2024
bd37a00
Add one more index. Use UNION ALL queries.
kaapstorm Oct 23, 2024
a448b9e
Don't report attempt too soon
kaapstorm Oct 26, 2024
4321fb7
Add metrics
kaapstorm Oct 26, 2024
74137f9
Improve backoff logic
kaapstorm Oct 26, 2024
968a922
Update comments
kaapstorm Oct 28, 2024
4fd14a0
Show "Next attempt at" in Forwarders page
kaapstorm Oct 28, 2024
07320b9
Merge branch 'master' into nh/iter_repeaters_1
kaapstorm Oct 28, 2024
b1eb171
Fixes migration
kaapstorm Oct 28, 2024
2463348
Merge remote-tracking branch 'origin/master' into nh/iter_repeaters_1
kaapstorm Oct 29, 2024
8a7f343
Add comment on other `True` return value
kaapstorm Nov 19, 2024
0f72ba9
Count repeater backoffs
kaapstorm Dec 2, 2024
7f18e52
Add documentation
kaapstorm Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
31 changes: 31 additions & 0 deletions corehq/ex-submodules/dimagi/utils/couch/tests/test_redis_lock.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import uuid

from redis.lock import Lock as RedisLock

from dimagi.utils.couch import get_redis_lock

from corehq.tests.noseplugins.redislocks import TestLock
from corehq.util.metrics.lockmeter import MeteredLock


def test_get_redis_lock_with_token():
lock_name = 'test-1'
metered_lock = get_redis_lock(key=lock_name, name=lock_name, timeout=1)
assert isinstance(metered_lock, MeteredLock)
# metered_lock.lock is a TestLock instance because of
# corehq.tests.noseplugins.redislocks.RedisLockTimeoutPlugin
test_lock = metered_lock.lock
assert isinstance(test_lock, TestLock)
redis_lock = test_lock.lock
assert isinstance(redis_lock, RedisLock)

token = uuid.uuid1().hex
acquired = redis_lock.acquire(blocking=False, token=token)
assert acquired

# What we want to be able to do in a separate process:
metered_lock_2 = get_redis_lock(key=lock_name, name=lock_name, timeout=1)
redis_lock_2 = metered_lock_2.lock.lock
redis_lock_2.local.token = token
# Does not raise LockNotOwnedError:
redis_lock_2.release()
2 changes: 2 additions & 0 deletions corehq/motech/repeaters/const.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
CHECK_REPEATERS_INTERVAL = timedelta(minutes=5)
CHECK_REPEATERS_PARTITION_COUNT = settings.CHECK_REPEATERS_PARTITION_COUNT
CHECK_REPEATERS_KEY = 'check-repeaters-key'
PROCESS_REPEATERS_INTERVAL = timedelta(minutes=1)
PROCESS_REPEATERS_KEY = 'process-repeaters-key'
ENDPOINT_TIMER = 'endpoint_timer'
# Number of attempts to an online endpoint before cancelling payload
MAX_ATTEMPTS = 3
Expand Down
16 changes: 16 additions & 0 deletions corehq/motech/repeaters/migrations/0015_repeater_max_workers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
("repeaters", "0014_alter_repeater_request_method"),
]

operations = [
migrations.AddField(
model_name="repeater",
name="max_workers",
field=models.IntegerField(default=0),
),
]
62 changes: 62 additions & 0 deletions corehq/motech/repeaters/migrations/0016_add_indexes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
from django.db import migrations, models


class Migration(migrations.Migration):
atomic = False

dependencies = [
("repeaters", "0015_repeater_max_workers"),
]

operations = [
migrations.SeparateDatabaseAndState(
state_operations=[
migrations.AlterField(
model_name="repeatrecord",
name="state",
field=models.PositiveSmallIntegerField(
choices=[
(1, "Pending"),
(2, "Failed"),
(4, "Succeeded"),
(8, "Cancelled"),
(16, "Empty"),
(32, "Invalid Payload"),
],
db_index=True,
default=1,
),
),
migrations.AddIndex(
model_name="repeater",
index=models.Index(
condition=models.Q(("is_deleted", False), ("is_paused", False)),
fields=["next_attempt_at"],
name="next_attempt_at_partial_idx",
),
),
],

database_operations=[
migrations.RunSQL(
sql="""
CREATE INDEX CONCURRENTLY "repeaters_repeatrecord_state_8055083b"
ON "repeaters_repeatrecord" ("state");
""",
reverse_sql="""
DROP INDEX CONCURRENTLY "repeaters_repeatrecord_state_8055083b";
"""
),
migrations.RunSQL(
sql="""
CREATE INDEX CONCURRENTLY "next_attempt_at_partial_idx"
ON "repeaters_repeater" ("next_attempt_at")
WHERE (NOT "is_deleted" AND NOT "is_paused");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I analyzed the query used by Repeater.objects.get_all_ready_ids_by_domain(), and this is the result on Staging:

> EXPLAIN ANALYZE SELECT "repeaters_repeater"."domain", "repeaters_repeater"."id_"
  FROM "repeaters_repeater"
    INNER JOIN "repeaters_repeatrecord"
    ON ("repeaters_repeater"."id_" = "repeaters_repeatrecord"."repeater_id_")
  WHERE (
    NOT "repeaters_repeater"."is_deleted"
    AND NOT "repeaters_repeater"."is_paused"
    AND (
      "repeaters_repeater"."next_attempt_at" IS NULL
      OR "repeaters_repeater"."next_attempt_at" <= '2024-10-19 22:11:36.310082'
    )
    AND "repeaters_repeatrecord"."state" IN (1, 2)
  );

                                                                                      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=8423.11..141317.33 rows=52115 width=30) (actual time=61.472..1547.920 rows=638862 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Hash Join  (cost=7423.11..135105.83 rows=21715 width=30) (actual time=56.214..1224.831 rows=212954 loops=3)
         Hash Cond: (repeaters_repeatrecord.repeater_id_ = repeaters_repeater.id_)
         ->  Parallel Bitmap Heap Scan on repeaters_repeatrecord  (cost=7289.92..134244.51 rows=276830 width=16) (actual time=52.249..1077.808 rows=223024 loops=3)
               Recheck Cond: (state = ANY ('{1,2}'::integer[]))
               Heap Blocks: exact=16852
               ->  Bitmap Index Scan on repeaters_repeatrecord_state_8055083b  (cost=0.00..7123.82 rows=664393 width=0) (actual time=44.255..44.256 rows=669452 loops=1)
                     Index Cond: (state = ANY ('{1,2}'::integer[]))
         ->  Hash  (cost=131.11..131.11 rows=167 width=30) (actual time=0.713..0.722 rows=163 loops=3)
               Buckets: 1024  Batches: 1  Memory Usage: 18kB
               ->  Bitmap Heap Scan on repeaters_repeater  (cost=5.84..131.11 rows=167 width=30) (actual time=0.089..0.660 rows=163 loops=3)
                     Filter: ((NOT is_deleted) AND (NOT is_paused) AND ((next_attempt_at IS NULL) OR (next_attempt_at <= '2024-10-19 22:11:36.310082+00'::timestamp with time zone)))
                     Rows Removed by Filter: 40
                     Heap Blocks: exact=67
                     ->  Bitmap Index Scan on repeaters_repeater_is_deleted_08441bf0  (cost=0.00..5.80 rows=203 width=0) (actual time=0.053..0.059 rows=253 loops=3)
                           Index Cond: (is_deleted = false)
 Planning Time: 0.874 ms
 Execution Time: 1623.954 ms
(20 rows)

The "next_attempt_at_partial_idx" was chosen to match the filter in that query plan, but Postgres is not using the index.

Should we drop it, change it, or leave it?

Copy link
Contributor

@millerdev millerdev Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much data is present on staging? If not much then those two indexes may have had very similar outcomes and the choice of which to use may not have been very meaningful.

I'd check the same or a similar query on prod to see if you get the same query plan.

Do we need both indexes? Would it break other queries if we added the partial condition NOT "is_deleted" AND NOT "is_paused" on repeaters_repeatrecord_state_8055083b and eliminated the next_attempt_at_partial_idx index?

Edit: realized that those indexes are on separate tables, so the question above did not make sense. Crossed out some bits, but I think the question of whether to add a condition to the repeatrecord index still makes sense.

Edit 2: I guess maybe that does not make sense since is_deleted and is_paused are not columns on the repeatrecord table. I think the gist of my question is whether there is a way to only index rows on that table that are active and eliminate the many millions of rows that will never be touched again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"repeaters_repeater"."next_attempt_at" IS NULL OR "repeaters_repeater"."next_attempt_at" <= '2024-10-19 22:11:36.310082'

This condition could be an issue. The NULL and the range check can't be accomplished with a single index operation which may be why it's not using the index.

You could try split the query into two queries and doing a union of the results. It seems counter intuitive but if the indexes get used the performance will be better:

SELECT "repeaters_repeater"."domain", "repeaters_repeater"."id_"
  FROM "repeaters_repeater"
    INNER JOIN "repeaters_repeatrecord"
    ON ("repeaters_repeater"."id_" = "repeaters_repeatrecord"."repeater_id_")
  WHERE (
    NOT "repeaters_repeater"."is_deleted"
    AND NOT "repeaters_repeater"."is_paused"
    AND "repeaters_repeater"."next_attempt_at" IS NULL
    AND "repeaters_repeatrecord"."state" IN (1, 2)
  )
UNION ALL
SELECT "repeaters_repeater"."domain", "repeaters_repeater"."id_"
  FROM "repeaters_repeater"
    INNER JOIN "repeaters_repeatrecord"
    ON ("repeaters_repeater"."id_" = "repeaters_repeatrecord"."repeater_id_")
  WHERE (
    NOT "repeaters_repeater"."is_deleted"
    AND NOT "repeaters_repeater"."is_paused"
    AND "repeaters_repeater"."next_attempt_at" <= '2024-10-19 22:11:36.310082'
    AND "repeaters_repeatrecord"."state" IN (1, 2)
  );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. It uses the index.

QUERY PLAN
----------
 Gather  (cost=1078.24..298737.78 rows=91675 width=30) (actual time=1860.938..7291.760 rows=952533 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Append  (cost=78.24..288570.28 rows=91675 width=30) (actual time=1103.767..5783.887 rows=317511 loops=3)
         ->  Hash Join  (cost=132.62..143624.77 rows=31897 width=30) (actual time=1.539..2456.512 rows=317502 loops=3)
               Hash Cond: (repeaters_repeatrecord.repeater_id_ = repeaters_repeater.id_)
               ->  Parallel Seq Scan on repeaters_repeatrecord  (cost=0.00..142389.57 rows=419192 width=16) (actual time=1.080..2293.795 rows=327791 loops=3)
                     Filter: (state = ANY ('{1,2}'::integer[]))
                     Rows Removed by Filter: 1776517
               ->  Hash  (cost=130.60..130.60 rows=162 width=30) (actual time=0.357..0.359 rows=153 loops=3)
                     Buckets: 1024  Batches: 1  Memory Usage: 17kB
                     ->  Bitmap Heap Scan on repeaters_repeater  (cost=5.84..130.60 rows=162 width=30) (actual time=0.067..0.295 rows=153 loops=3)
                           Filter: ((NOT is_deleted) AND (NOT is_paused) AND (next_attempt_at IS NULL))
                           Rows Removed by Filter: 50
                           Heap Blocks: exact=67
                           ->  Bitmap Index Scan on repeaters_repeater_is_deleted_08441bf0  (cost=0.00..5.80 rows=203 width=0) (actual time=0.044..0.045 rows=253 loops=3)
                                 Index Cond: (is_deleted = false)
         ->  Hash Join  (cost=78.24..143570.39 rows=6301 width=30) (actual time=1655.248..4904.097 rows=14 loops=2)
               Hash Cond: (repeaters_repeatrecord_1.repeater_id_ = repeaters_repeater_1.id_)
               ->  Parallel Seq Scan on repeaters_repeatrecord repeaters_repeatrecord_1  (cost=0.00..142389.57 rows=419192 width=16) (actual time=0.043..4756.595 rows=491687 loops=2)
                     Filter: (state = ANY ('{1,2}'::integer[]))
                     Rows Removed by Filter: 2664776
               ->  Hash  (cost=77.84..77.84 rows=32 width=30) (actual time=39.663..39.664 rows=10 loops=2)
                     Buckets: 1024  Batches: 1  Memory Usage: 9kB
                     ->  Bitmap Heap Scan on repeaters_repeater repeaters_repeater_1  (cost=4.39..77.84 rows=32 width=30) (actual time=39.626..39.647 rows=10 loops=2)
                           Recheck Cond: ((next_attempt_at <= '2024-10-19 22:11:36.310082+00'::timestamp with time zone) AND (NOT is_deleted) AND (NOT is_paused))
                           Heap Blocks: exact=7
                           ->  Bitmap Index Scan on next_attempt_at_partial_idx  (cost=0.00..4.38 rows=32 width=0) (actual time=39.577..39.577 rows=10 loops=2)
                                 Index Cond: (next_attempt_at <= '2024-10-19 22:11:36.310082+00'::timestamp with time zone)
 Planning Time: 3.019 ms
 Execution Time: 7411.459 ms
(31 rows)

Out of curiosity, I'll try a composite index and see what difference it makes. Either way, this feels like a big step in the right direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a composite index, planning time is 2.5 ms faster, but execution time is 363 ms slower. The partial index wins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some changes that made a significant difference. efc4dde

The details are in the commit message:

After some analyzing, it turns out that the query is between two and three times as fast using a partial index on RepeatRecord.repeater_id where state is Pending or Failed. Performance is also improved using a partial index with Repeater.is_deleted instead of a normal B-tree index. When these two indexes are used, the next_attempt_at_partial_idx is not used.

I've dropped the next_attempt_at_partial_idx because of that, and changed the indexes for Repeater.state and Repeater.is_deleted to be partial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Parallel Seq Scan on repeaters_repeatrecord is going to be a problem if that query plan is chosen on prod. In comparison, Index Scan on repeaters_repeater_... is inconsequential because there are only ~2000 (tiny number) repeaters whereas there are millions of repeat records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to close the loop here: I tested the RepeatRecord.state index on Prod, and Postgres uses it for both subqueries. I tested on Friday afternoon, and the query took around 27 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefits of this PR are big enough for us to iterate on step-wise improvements after it is merged. One improvement I think we should consider is replacing the repeaters_repeatrecord table with nested partitioning / subpartitioning. Something like:

CREATE TABLE repeaters_repeatrecord_p (
  -- ...
) PARTITION BY LIST (state);

CREATE TABLE repeaters_repeatrecord_ready
  PARTITION OF repeaters_repeatrecord_p
  FOR VALUES IN (1, 2);

CREATE TABLE repeaters_repeatrecord_done
  PARTITION OF repeaters_repeatrecord_p
  FOR VALUES IN NOT (1, 2)
  PARTITION BY RANGE (registered_on);

CREATE TABLE repeaters_repeatrecord_2010
  PARTITION OF repeaters_repeatrecord_done
  FOR VALUES FROM ('2010-01-01') TO ('2011-01-01');

-- etc.

This will improve queries on repeat records ready to be sent, and also allow us to drop completed repeat records after some expiry (TBD).

""",
reverse_sql="""
DROP INDEX CONCURRENTLY "next_attempt_at_partial_idx";
"""
),
]
)
]
102 changes: 89 additions & 13 deletions corehq/motech/repeaters/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
from http import HTTPStatus
from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse

from django.conf import settings
from django.db import models, router
from django.db.models.base import Deferred
from django.dispatch import receiver
Expand Down Expand Up @@ -245,10 +246,19 @@ def all_ready(self):
repeat_records_ready_to_send = models.Q(
repeat_records__state__in=(State.Pending, State.Fail)
)
return (self.get_queryset()
.filter(not_paused)
.filter(next_attempt_not_in_the_future)
.filter(repeat_records_ready_to_send))
return (
self.get_queryset()
.filter(not_paused)
.filter(next_attempt_not_in_the_future)
.filter(repeat_records_ready_to_send)
)

def get_all_ready_ids_by_domain(self):
results = defaultdict(list)
query = self.all_ready().values_list('domain', 'id')
for (domain, id_uuid) in query.all():
results[domain].append(id_uuid.hex)
return results

def get_queryset(self):
repeater_obj = self.model()
Expand All @@ -275,6 +285,7 @@ class Repeater(RepeaterSuperProxy):
is_paused = models.BooleanField(default=False)
next_attempt_at = models.DateTimeField(null=True, blank=True)
last_attempt_at = models.DateTimeField(null=True, blank=True)
max_workers = models.IntegerField(default=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of spinning this feature off in a separate PR (if we decided it's needed)? We have a lot of knobs to turn already, and I wonder if we'll need this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We needed this one years ago. OpenMRS integrations involve Events and Observations, and it is really useful if we can send repeat records in the order in which their forms were submitted. More generally, setting max_workers to 1 allows us to ensure that repeat records are sent chronologically.

And setting it to [all of the workers] is an insurance policy for this PR to ensure that we can still handle the highest volume for repeaters that need it. I'd sleep easier if this is included.

options = JSONField(default=dict)
connection_settings_id = models.IntegerField(db_index=True)
is_deleted = models.BooleanField(default=False, db_index=True)
Expand All @@ -286,6 +297,13 @@ class Repeater(RepeaterSuperProxy):

class Meta:
db_table = 'repeaters_repeater'
indexes = [
models.Index(
fields=['next_attempt_at'],
condition=models.Q(("is_deleted", False), ("is_paused", False)),
name='next_attempt_at_partial_idx',
),
]

payload_generator_classes = ()

Expand Down Expand Up @@ -365,9 +383,24 @@ def _repeater_type(cls):

@property
def repeat_records_ready(self):
return self.repeat_records.filter(state__in=(State.Pending, State.Fail))
"""
A QuerySet of repeat records in the Pending or Fail state in the
order in which they were registered
"""
return (
self.repeat_records
.filter(state__in=(State.Pending, State.Fail))
.order_by('registered_at')
)

def set_next_attempt(self):
@property
def num_workers(self):
# If num_workers is 1, repeat records are sent in the order in
# which they were registered.
num_workers = self.max_workers or settings.DEFAULT_REPEATER_WORKERS
return min(num_workers, settings.MAX_REPEATER_WORKERS)

def set_backoff(self):
now = datetime.utcnow()
interval = _get_retry_interval(self.last_attempt_at, now)
self.last_attempt_at = now
Expand All @@ -380,8 +413,12 @@ def set_next_attempt(self):
next_attempt_at=now + interval,
)

def reset_next_attempt(self):
def reset_backoff(self):
if self.last_attempt_at or self.next_attempt_at:
# `_get_retry_interval()` implements exponential backoff by
# multiplying the previous interval by 3. Set last_attempt_at
# to None so that the next time we need to back off, we
# know it is the first interval.
self.last_attempt_at = None
kaapstorm marked this conversation as resolved.
Show resolved Hide resolved
self.next_attempt_at = None
# Avoid a possible race condition with self.pause(), etc.
Expand Down Expand Up @@ -991,11 +1028,17 @@ def get_repeat_record_ids(self, domain, repeater_id=None, state=None, payload_id
class RepeatRecord(models.Model):
domain = models.CharField(max_length=126)
payload_id = models.CharField(max_length=255)
repeater = models.ForeignKey(Repeater,
on_delete=DB_CASCADE,
db_column="repeater_id_",
related_name='repeat_records')
state = models.PositiveSmallIntegerField(choices=State.choices, default=State.Pending)
repeater = models.ForeignKey(
Repeater,
on_delete=DB_CASCADE,
db_column="repeater_id_",
related_name='repeat_records',
)
state = models.PositiveSmallIntegerField(
choices=State.choices,
default=State.Pending,
db_index=True,
)
registered_at = models.DateTimeField()
next_check = models.DateTimeField(null=True, default=None)
max_possible_tries = models.IntegerField(default=MAX_BACKOFF_ATTEMPTS)
Expand Down Expand Up @@ -1175,7 +1218,8 @@ def fire(self, force_send=False, timing_context=None):
self.repeater.fire_for_record(self, timing_context=timing_context)
except Exception as e:
self.handle_payload_error(str(e), traceback_str=traceback.format_exc())
raise
return self.state
return None

def attempt_forward_now(self, *, is_retry=False, fire_synchronously=False):
from corehq.motech.repeaters.tasks import (
Expand All @@ -1185,6 +1229,19 @@ def attempt_forward_now(self, *, is_retry=False, fire_synchronously=False):
retry_process_datasource_repeat_record,
)

def is_new_synchronous_case_repeater_record():
"""
Repeat record is a new record for a synchronous case repeater
See corehq.motech.repeaters.signals.fire_synchronous_case_repeaters
"""
return fire_synchronously and self.state == State.Pending

if (
toggles.PROCESS_REPEATERS.enabled(self.domain, toggles.NAMESPACE_DOMAIN)
and not is_new_synchronous_case_repeater_record()
):
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the scenario where this toggle is enabled and then later disabled. With next_check discrepancies between RepeatRecord and Repeater be handled gracefully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If by "gracefully" you mean "will all repeat records be processed when the toggle is disabled?", then yes.

But if by "gracefully" you mean "will Datadog show the repeat records as not overdue?", then no. Datadog will show the repeat records as overdue.

Alternatively, when we apply a back-off to a repeater, we could also update all its pending and failed repeat records too. I considered this, but it felt like a lot of churn on the repeat records table. I thought a better approach would be to use a new metric for Datadog to gauge when the repeat records queue is getting backed up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


if self.next_check is None or self.next_check > datetime.utcnow():
return

Expand Down Expand Up @@ -1337,7 +1394,26 @@ def is_response(duck):


def domain_can_forward(domain):
"""
Returns whether ``domain`` has data forwarding or Zapier integration
privileges.

Used for determining whether to register a repeat record.
"""
return domain and (
domain_has_privilege(domain, ZAPIER_INTEGRATION)
or domain_has_privilege(domain, DATA_FORWARDING)
)


def domain_can_forward_now(domain):
millerdev marked this conversation as resolved.
Show resolved Hide resolved
"""
Returns ``True`` if ``domain`` has the requisite privileges and data
forwarding is not paused.

Used for determining whether to send a repeat record now.
"""
return (
domain_can_forward(domain)
and not toggles.PAUSE_DATA_FORWARDING.enabled(domain)
)
Loading
Loading