Remove all time inefficient loops from mlbf #22861

KevinMind · 2024-11-15T19:52:25Z

Description

extract the database querying logic from MLBFDatabaseLoader to module scope functions for easier mocking
reduce number of loops required to build initial dataset in MLBFDatabaseLoader to 1
filter not_blocked_items with an O(1) set instead of list
refactor generate_diffs to accept a block_type and only diff one set of data at a time.
add an lru_cache to generate_diffs to ensure computation is executed only one time per instance + block_type + previous_mlbf
collect all mlbf tests under test_mlbf mark

Context

This patch includes logic that should reduce the number of long loops required to process our mlbf data sets and determine the cache and diff of blocks across different scenarios.

Testing

Run the mlbf suite

make test_mlbf

This will run all of the mlbf related tests. Expect everything passes

Test the timeout for the slow test

modify the pytest.mark.timeout(120) reducing the time to 3 (or some low number
run the test

pytest ./src/olympia/blocklist/tests/test_cron.py::TestUploadToRemoteSettings::test_large_data_set

expect the test to fail due to timeout

FAILED src/olympia/blocklist/tests/test_cron.py::TestUploadToRemoteSettings::test_large_data_set - Failed: Timeout >10.0s

Test in local environment

IF you want to test with a real data set you will have to create a large amount of blocks and execute the cron

Code snippet

Execute this code. You will need to execute many time or bump `1_000 up to between 1 and 2 million.. grab a coffee this will take a while.

After that is finished, execute ./manage.py cron upload_mlbf_to_remote_settings and expect it to take up to 2 minutes to complete.

Checklist

Add #ISSUENUM at the top of your PR to an existing open issue in the mozilla/addons repository.
Successfully verified the change locally.
The change is covered by automated tests, or otherwise indicated why doing so is unnecessary/impossible.
Add before and after screenshots (Only for changes that impact the UI).
Add or update relevant docs reflecting the changes made.

willdurand · 2024-11-19T10:01:06Z

Thanks for adding the new tests, that looks good at first glance. I'm going to focus on #22828 so I'll let someone else take care of this PR.

Fixes: mozilla/addons#15170

I'd suggest filing a new issue for this, it could be about adding tests (and fixing more things revealed by the tests)

eviljeff

wdyt?

eviljeff · 2024-11-21T13:01:05Z

src/olympia/blocklist/mlbf.py

@@ -21,15 +22,47 @@


 def ordered_diff_lists(


Can we just use sets instead? They're much more efficient for comparison, and, to me, sticking to sets everywhere would be more straightforward.

We only need a specific order when we're writing to JSON, (because it doesn't support sets), so we could just sort the data just before writing.

I'm afraid chat GPT and myself disagree with you on this one. Here's the breakdown.

TLDR; we save no time (lose a little bit actually) using set difference. On top of that it is less space efficient, and as you said does not preserve order. I see no reason to prefer it based on merit.

In the current implementation we have:

Building previous_set: This takes O(len(previous)) time.

List comprehension:

Iterates over current, which is O(len(current)).

The x not in previous_set check is O(1) per element due to the constant-time lookup in a set.

Total Time Complexity: O(len(current) + len(previous))
Space Complexity: O(len(previous)) for previous_set.

In the set only option:

Building set(current): O(len(current))

Building set(previous): O(len(previous))

Set difference set(current) - set(previous):

Iterates over set(current), which is O(len(set(current))).

For each element, checks if it's in set(previous), which is O(1) per element.

Converting the set difference to a list: O(len(result))

Total Time Complexity: O(2 X len(current) + len(previous))
Space Complexity: O(len(current) + len(previous)) for both sets.

Comparison:

Time Complexity: Both functions have the same time complexity of O(len(current) + len(previous)).

Space Complexity: Function 1 uses less additional space because it only builds one set (previous_set), whereas Function 2 builds two sets (set(current) and set(previous)).

Order Preservation: Function 1 preserves the order of elements in current, while Function 2 does not, because sets are unordered collections.

If we're just considering this function chatGPT may have a point, but I meant use sets throughout the code in this file; then we can use more efficient set arithmetic and drop this function. (And do sorted(...) only when we're writing json)

FWIW I wasn't considering just this function and it doesn't make a difference in any case. Using set arithmetic is going to result in more space usage and the same amount of time usage. It simply does not benefit us in this use outside of perhaps looking nicer.

Additionally, one of the issues we had with the approach of sorting before writing is the built in sort method didn't seem to work predictably causing really flaky tests.. I think we'd have to figure that out if we wanted to switch to sets.

src/olympia/blocklist/mlbf.py

eviljeff · 2024-11-21T13:10:36Z

src/olympia/blocklist/mlbf.py

+        for guid, version, id, block_type in get_all_blocked_items():
+            _blocked_version_ids.append(id)
+            _blocked[block_type].append((guid, version))


Suggested change

for guid, version, id, block_type in get_all_blocked_items():

_blocked_version_ids.append(id)

_blocked[block_type].append((guid, version))

for guid, version_string, version_id, block_type in get_all_blocked_items():

_blocked_version_ids.append(version_id)

_blocked[block_type].append((guid, version_string))

Though I do wonder if it's more efficient to generate _blocked_version_ids and _blocked with list/set generator syntax, rather than loop through with a for loop and append one at a time.

Well this loop will be an O(n) over all blocked items.
And there will be 2 O(1) appends per iteration.. so still O(n)

Do you have an example of something that would be more efficient? I thnk this doesn't look as pretty but is actually the efficient way to go.

I was thinking more that, in python anyway, it's generally more efficient to build a list with a generator than to build it by iterating and repeatedly appending items (because it needs resizing and reallocating more). I don't have stats to back-up the difference between generating two lists vs. iterating over the list once though.

Yeah IDK about the perf difference there, but I imagine it would not be significant since we are comparing a slightly slower single iteration versus a slightly faster double iteration. We could give it a try and see if there is a big diff.

src/olympia/blocklist/mlbf.py

src/olympia/blocklist/tests/test_cron.py

Remove test Remove all time inefficient loops from mlbf

eviljeff

... yeah? I remain in the pro-set camp though. And I'd like someone who's been more involved with the intricacies of the current soft-blocking work (e..g @willdurand) to give this a once-over too.

(I understand this may not be merged for a while anyway, if we're soft-freezing blocklisting changes for QA)

eviljeff · 2024-12-02T12:34:27Z

src/olympia/blocklist/mlbf.py

@@ -21,15 +22,47 @@


 def ordered_diff_lists(


If we're just considering this function chatGPT may have a point, but I meant use sets throughout the code in this file; then we can use more efficient set arithmetic and drop this function. (And do sorted(...) only when we're writing json)

eviljeff · 2024-12-02T12:43:32Z

src/olympia/blocklist/mlbf.py

+        for guid, version, id, block_type in get_all_blocked_items():
+            _blocked_version_ids.append(id)
+            _blocked[block_type].append((guid, version))


I was thinking more that, in python anyway, it's generally more efficient to build a list with a generator than to build it by iterating and repeatedly appending items (because it needs resizing and reallocating more). I don't have stats to back-up the difference between generating two lists vs. iterating over the list once though.

eviljeff · 2024-12-02T12:45:09Z

src/olympia/blocklist/mlbf.py

@@ -233,16 +262,17 @@ def generate_and_write_filter(self):

        log.info(json.dumps(stats))

+    @lru_cache(maxsize=128)  # noqa: B019


nit: I don't think we're ever going to call the function 128 times 🙃 - it's typically like, 3 times, isn't it?

I think it can be more than three but yeah probably not 128. I think I copy pasted that from docs.. maybe there is a sensible default.

eviljeff · 2024-12-02T12:47:18Z

src/olympia/blocklist/tests/test_cron.py

+    # This test is very slow, so we increase the timeout to 180 seconds.
+    @pytest.mark.timeout(120)


120 or 180?

diox · 2024-12-02T14:57:06Z

Just to make it clear: because QA is testing things right now and this is the last push of the year, we'd like to avoid big changes to blocklisting, so this should NOT land in this sprint, even if approved.

KevinMind force-pushed the remove-all-quadratics branch 2 times, most recently from dfd3c9c to 42483e3 Compare November 15, 2024 20:14

KevinMind marked this pull request as ready for review November 15, 2024 20:42

KevinMind requested a review from willdurand November 15, 2024 20:42

KevinMind force-pushed the remove-all-quadratics branch 2 times, most recently from b5495d2 to 64858dc Compare November 18, 2024 12:29

willdurand mentioned this pull request Nov 18, 2024

Remove time inefficient loop in not_blocked_items #22862

Merged

5 tasks

KevinMind force-pushed the remove-all-quadratics branch from 64858dc to fe76d06 Compare November 18, 2024 13:42

KevinMind marked this pull request as draft November 18, 2024 14:01

KevinMind force-pushed the remove-all-quadratics branch from fe76d06 to 372b9e4 Compare November 18, 2024 14:05

KevinMind marked this pull request as ready for review November 18, 2024 14:06

KevinMind force-pushed the remove-all-quadratics branch from 372b9e4 to 5c3819e Compare November 18, 2024 14:07

willdurand requested review from a team and eviljeff and removed request for willdurand and a team November 19, 2024 09:59

eviljeff reviewed Nov 21, 2024

View reviewed changes

KevinMind force-pushed the remove-all-quadratics branch from 5c3819e to 26f0cf2 Compare November 21, 2024 16:00

KevinMind requested a review from eviljeff November 21, 2024 16:06

Remove time inefficient loops in mlbf

0406b1a

Remove test Remove all time inefficient loops from mlbf

KevinMind force-pushed the remove-all-quadratics branch from 26f0cf2 to 0406b1a Compare November 21, 2024 16:11

eviljeff approved these changes Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove all time inefficient loops from mlbf #22861

Remove all time inefficient loops from mlbf #22861

KevinMind commented Nov 15, 2024 •

edited

Loading

willdurand commented Nov 19, 2024 •

edited

Loading

eviljeff left a comment

eviljeff Nov 21, 2024

KevinMind Nov 21, 2024

eviljeff Dec 2, 2024

KevinMind Dec 2, 2024

eviljeff Nov 21, 2024

eviljeff Nov 21, 2024

KevinMind Nov 21, 2024

eviljeff Dec 2, 2024 •

edited

Loading

KevinMind Dec 2, 2024

eviljeff left a comment

eviljeff Dec 2, 2024

eviljeff Dec 2, 2024 •

edited

Loading

eviljeff Dec 2, 2024

KevinMind Dec 2, 2024

eviljeff Dec 2, 2024

diox commented Dec 2, 2024

		@@ -233,16 +262,17 @@ def generate_and_write_filter(self):

		log.info(json.dumps(stats))

		@lru_cache(maxsize=128) # noqa: B019

		# This test is very slow, so we increase the timeout to 180 seconds.
		@pytest.mark.timeout(120)

Remove all time inefficient loops from mlbf #22861

Are you sure you want to change the base?

Remove all time inefficient loops from mlbf #22861

Conversation

KevinMind commented Nov 15, 2024 • edited Loading

Description

Context

Testing

Run the mlbf suite

Test the timeout for the slow test

Test in local environment

Checklist

willdurand commented Nov 19, 2024 • edited Loading

eviljeff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eviljeff Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eviljeff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eviljeff Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diox commented Dec 2, 2024

KevinMind commented Nov 15, 2024 •

edited

Loading

willdurand commented Nov 19, 2024 •

edited

Loading

eviljeff Dec 2, 2024 •

edited

Loading

eviljeff Dec 2, 2024 •

edited

Loading