Rebase oss-fuzz onto master #4593

jonathanmetzman · 2025-01-08T22:09:49Z

No description provided.

Only do this in oss-fuzz for now.

From now on I will commit oss-fuzz work specifically to this branch. --------- Co-authored-by: Vitor Guidi <[email protected]> Co-authored-by: Ali HIJAZI <[email protected]>

Also increase the number of utasks we can get from the queue

Use randomness to "load balance" for now. --------- Co-authored-by: Vitor Guidi <[email protected]> Co-authored-by: Ali HIJAZI <[email protected]>

Scale up fuzzing on batch

1. Don't defer the tasks, they've already been deferred. 2. Improve the scheduling time by 20% by batching calls so that we only need to read YAML and query the database once, per queue read. --------- Co-authored-by: Vitor Guidi <[email protected]> Co-authored-by: Ali HIJAZI <[email protected]>

Use imap_unordered for these two improvements. 1. Async signing, we can start signing the arbitrary URLs before we are done listing and signing the existing URLs. 2. Ignoring the order means we can sign/list even faster. 3. Use of iterables means we don't need to use unnecessary memory.

We're experiencing big delays in our batch scheduling (meaning batch is an hour behind). We may have too many jobs for batch. Start trying to bunch the tasks more efficiently so we have fewer jobs.

The experiment failed because batch is throttling job creation for some reason. Lower the amount of tasks we schedule so we don't create a backlog that prevents non-fuzz tasks from being run.

It's not used anymore. Related: b/381528482

We don't use it anymore. Related: b/381528482

Allow utask_main scheduling

Also delete some example code.

Instead of trying to schedule a small amount of fuzz tasks every 2 minutes and hoping this leads to fuzzing at full capacity, just schedule almost the full amount at once.

It's caused by this exception https://pantheon.corp.google.com/errors/detail/CNXrgouhptTW3QE;filter=%5B%22%5C%22Corpus%20pruning%20failed:%20%5BErrno%202%5D%20No%20such%20file%20or%20directory%5C%22%22%5D;locations=global?e=-13802955&mods=logs_tg_prod&project=google.com:clusterfuzz

…estcase count (#4494) ### Motivation #4364 implemented a metric to track the percentile distributions of untriaged testcase age. It was overcounting testcases: * Testcases for which a bug was already filed * Testcases for which the crash was unimportant This PR solves this issue, and adds the UNTRIAGED_TESTCASE_COUNT metric, drilled down by job and platform, so we can also know how many testcases are stuck, and not only their age distribution.

The metric for untriaged testcae age was not considering bugs that were being filed legitimately, so there was no metric emission at all. Also, removes granularity in the stuck testcase count metric.

…COUNT (#4500) ### Motivation The UNTRIAGED_TESTCASE_COUNT metric worked as expected, however it is hard to figure out which job the testcase belongs to, and why the testcase is stuck. This PR adds the job label, and a status one, which can be one of four values: PENDING_CRITICAL_TASKS = 'pending_critical_tasks' PENDING_PROGRESSION = 'pending_progression' PENDING_GROUPING = 'pending_grouping' PENDING_FILING = 'pending_filing' Since (job, status) tuples might disappear from one triage run to the next, the period of time the gauge is valid for should be, at most, the interval at which the triage cronjob runs, otherwise it might overcount things. Part of #4271

Fixes the following error: ``` AttributeError 'ProtoType' object has no attribute 'DESCRIPTOR' Failed to flush metrics: 'ProtoType' object has no attribute 'DESCRIPTOR' Traceback (most recent call last): File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 118, in _flush_metrics metric.monitoring_v3_time_series(series, labels, start_time, end_time, File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 350, in monitoring_v3_time_series self._set_value(point.value, value) File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 404, in _set_value point.int64_value = value ^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/third_party/proto/message.py", line 935, in __setattr__ pb_value = marshal.to_proto(pb_type, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/third_party/proto/marshal/marshal.py", line 229, in to_proto proto_type.DESCRIPTOR.has_options ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'ProtoType' object has no attribute 'DESCRIPTOR' ```

…ss, maybe_retry and failure outcomes (#4499) ### Motivation #4458 implemented a task outcome metric, so we can track error rates in utasks, by job/task/subtask. As failures are expected for ClusterFuzz, initially only unhandled exceptions would be considered as actual errors. Chrome folks asked for a better partitioning of error codes, which is implemented here as the following outcomes: * success: the task has unequivocally succeeded, producing a sane result * maybe_retry: some transient error happened, and the task is potentially being retried. This might capture some unretriable failure condition, but it is a compromise we are willing to make in order to decrease false positives. * failure: the task has unequivocally failed. Part of #4271

…zzer generated test cases (#4481) ### Motivation [#4364](#4364) implemented the tracking for the time it takes clusterfuzz to complete several steps of the manually uploaded testcase lifecycle. As per Chrome's request, the metric will now contain an 'origin' label, which indicates if the testcase was 'manually_uploaded' or generated by a 'fuzzer'. The code was also simplified, by reusing the get_age_in_seconds method from the TestCase entity. Also, it adds the 'stuck_in_triage' boolean field to the testcase entity, to facilitate figuring out what testcases are in a stuck state, so follow up actions can be taken. Part of #4271

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

This adds 9 archives from Fuzzilli to the general test-input archives used by fuzzers on Clusterfuzz. The Fuzzilli-side archives are refreshed every few days. We'll add a freshness metric in a follow up. This was tested locally with butler. See bug: https://crbug.com/362963886

Allow instances to specify the GCS bucket location for data bundle buckets in `project.yaml` as a new key: `data_bundle_bucket_location`. This will allow creating regional buckets instead of using the default `US` multi-region which results in high data transfer costs in Chrome's instance.

Filters mean that pagination isn't handled for you.

Custom binaries are often used emulators (i.e. the binary being used by the host). This code follows the same strategy used for the ADB binary above in `get_adb_path()`.

This was requested by Chrome in b/384493595. With more team members joining, I think it's best to enforce this policy, than expect everyone to know it.

This enables a job or fuzzer config to disable minimization with an environment variable. We can specify this at the additional environment variables of particular fuzzers like Binaryen, that don't work well with Clusterfuzz' built-in minimization attempts. Bug: https://crbug.com/380264190

Calling the function was missing in #4515 Bug: https://crbug.com/362963886

#4516) ### Motivation #4458 implemented an error rate for utasks, only considering exceptions. In #4499 , outcomes were split between success, failure and maybe_retry conditions. There we learned that the volume of retryable outcomes is negligible, so it makes sense to count them as failures. Listing out all the success conditions under _MetricRecorder is not desirable. However, we are consciously taking this technical debt so we can deliver #4271 . A refactor of uworker main will be later performed, so we can split the success and failure conditions, both of which are mixed in uworker_output.ErrorType. Reference for tech debt acknowledgement: #4517

…stuck in analyze (#4547) ### Motivation We currently have no way to tell if analyze task was successfully executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would only track duration for tasks that did finish. An analyze_pending field is added to the Testcase entity in datastore, which is set to False by default, to True for manually uploaded testcases, and to False once analyze task postprocess runs. It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a status label, so we can know at what step the testcase is stuck, thus allowing us to alert if analyze is taking longer to finish than expected. The alert itself could be, for instance, P50 age of untriaged testcase (status=analyze_pending) > 3h. Also, this retroactively addresses comments from #4481: * Fixes docstring for emit_testcase_triage_duration_metric * Removes assertions * Renames TESTCASE_UPLOAD_TRIAGE_DURATION to TESTCASE_TRIAGE_DURATION, since it now accounts for fuzzer generated testcases * Use a boolean "from_fuzzer" field, instead of "origin" string, in TESTCASE_TRIAGE_DURATION

Clusterfuzz will either confirm bugs are reproducible, mark them as non reproducible and close, or say nothing. In the last case, we want to auto close any bugs that have been open for long enough for it to have tried reproducing and failed

### Motivation As a final step for the monitoring project, this PR creates a dashboard for overall system health. The reasoning behind it is to have: * Business level metrics (fuzzing hours, generated testcases, issues filed, issues closed, testcases for which processing is pending) * Testcase metrics (untriaged testcase age and count) * SQS metrics (queue size, and published messages, per topic) * Datastore/GCS metrics (number of requests, error rate, and latencies) * Utask level metrics (duration, number of executions, error rate, latency) These are sufficient to apply the [RED methodology](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/) (rate, error and duration), provide high level metrics we can alert on, and aid in troubleshooting outages with a well defined methodology. There were two options to commit this to version control: terraform, or butler definitions. The first was chosen, since it is the preffered long term solution, and it is also simpler to implement, since it supports copy pasting the JSON definition from GCP. ### Attention points This should be automatically imported from main.tf, so it (should be) sufficient to just place the .tf file in the same folder, and have butler deploy handle the terraform apply step. ### How to review Head over to go/cf-chrome-health-beta, internally. It is not expected that the actual dashboard definition is reviewed, it is just a dump of what was manually created in GCP. Part of #4271

Reverting #4497 due to terraform apply erroring out, will reland once it is fixed.

This brings back #4497 . The issue was that, in promql definitions, the '{}' characters were lacking a double escape (\\\\{ and \\\\}). Also, the only parameter to the terraform object should be dashboard_json. Testing: terraform validate ![image](https://github.com/user-attachments/assets/f67434d0-c8e4-4bc0-be38-27188c98ea49) Part of #4271

Chrome is split in two projects internally, and the current way terraform is organized will create the dashboard in the wrong project. This PR makes the dashboard reference the secondary project id, thus solving the problem. Part of #4271

Appengine crons are currently not instrumented, so we cannot get testcase metrics for the google 3 deployment. This PR fixes that. Part of #4271

This solves the following error: ``` bad argument type for built-in operation\nTraceback (most recent call last):\n File \"/srv/handlers/base_handler.py\", line 278, in dispatch_request\n return super().dispatch_request(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/views.py\", line 188, in dispatch_request\n return current_app.ensure_sync(meth)(**kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/srv/libs/handler.py\", line 100, in wrapper\n with monitor.wrap_with_monitoring():\n File \"/layers/google.python.runtime/python/lib/python3.11/contextlib.py\", line 137, in __enter__\n return next(self.gen)\n ^^^^^^^^^^^^^^\n File \"/srv/clusterfuzz/_internal/metrics/monitor.py\", line 147, in wrap_with_monitoring\n initialize()\n File \"/srv/clusterfuzz/_internal/metrics/monitor.py\", line 605, in initialize\n _initialize_monitored_resource()\n File \"/srv/clusterfuzz/_internal/metrics/monitor.py\", line 574, in _initialize_monitored_resource\n _monitored_resource.labels['instance_id'] = instance_name\n ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^\nTypeError: bad argument type for built-in operation", ``` The instance ID was being improperly fetched. Docs for env var can be found in https://cloud.google.com/appengine/docs/standard/python3/runtime

This makes sure that NOTREACHED()s are classified as CHECK failures and handled similarly. Before this change clusterfuzz filed a lot of issues as "Abrt in logging::NotReachedLogMessage::~NotReachedLogMessage".

…4588) This addresses the errors on b/384820142 The main issue is that, in _MetricsRecorder, trimmed_labels was a shallow copy for self._labels, which had job removed, thus causing many utask duration with different jobs to reduce to the same labels (deduplication seems to only take into account the labels declared in monitoring_metrics.py) The other error is related to saturation, where time series would be written too often. This probably happens due to the loss of the job label, causing many requests to a subset of the original labels. It can possibly go away once this lands, otherwise more investigation is needed.

Updated triage.py to file bugs for fuzzer crashes by lowering the threshold limit. These fuzzers run on Android platform and on libfuzzer engine.

The last terraform definition was wrongly escaping the curly braces with "\\". The actual intended way to do it is to double the dollar sign Stack overflow reference: https://stackoverflow.com/questions/55796685/terraform-escaping-bash-command-substitution The concrete problem that this causes, is that every promql query would fail in the dashboard Testing strategy: terraform validate ran successfully ![image](https://github.com/user-attachments/assets/5fb95015-6f60-42d4-8b25-dfb89f9bcbf8)

…ng (#4592) This fixes the following error in app engine crons: ``` Failed to flush metrics: expected string or bytes-like object, got 'NoneType' Traceback (most recent call last): File "/layers/google.python.runtime/python/lib/python3.11/threading.py", line 1002, in _bootstrap self._bootstrap_inner() File "/layers/google.python.runtime/python/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/layers/google.python.runtime/python/lib/python3.11/threading.py", line 982, in run self._target(*self._args, **self._kwargs) File "/srv/clusterfuzz/_internal/metrics/monitor.py", line 171, in _flush_loop self._flush_function() File "/srv/clusterfuzz/_internal/metrics/monitor.py", line 134, in _flush_metrics logs.error(f'Failed to flush metrics: {e}') LogError: Failed to flush metrics: expected string or bytes-like object, got 'NoneType' Traceback (most recent call last): File "/srv/clusterfuzz/_internal/metrics/monitor.py", line 118, in _flush_metrics metric.monitoring_v3_time_series(series, labels, start_time, end_time, File "/srv/clusterfuzz/_internal/metrics/monitor.py", line 343, in monitoring_v3_time_series self.monitoring_v3_metric(time_series.metric, labels) File "/srv/clusterfuzz/_internal/metrics/monitor.py", line 321, in monitoring_v3_metric metric.labels['region'] = _get_region(bot_name) ^^^^^^^^^^^^^^^^^^^^^ File "/srv/clusterfuzz/_internal/metrics/monitor.py", line 637, in _get_region if re.match(pattern['pattern'], bot_name): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/layers/google.python.runtime/python/lib/python3.11/re/__init__.py", line 166, in match return _compile(pattern, flags).match(string) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: expected string or bytes-like object, got 'NoneType' ```

jonathanmetzman and others added 30 commits January 8, 2025 17:08

[oss-fuzz] Run every preprocess in preprocess pool (#4435)

0136b91

Only do this in oss-fuzz for now.

Merge master into oss-fuzz (#4449)

1ed1f33

From now on I will commit oss-fuzz work specifically to this branch. --------- Co-authored-by: Vitor Guidi <[email protected]> Co-authored-by: Ali HIJAZI <[email protected]>

Fix issue scheduling tasks on preprocess queue (#4451)

cffa277

Also increase the number of utasks we can get from the queue

Support multiple regions in batch (#4453)

e996e23

Use randomness to "load balance" for now. --------- Co-authored-by: Vitor Guidi <[email protected]> Co-authored-by: Ali HIJAZI <[email protected]>

Increase fuzz tasks scheduled from 1250->2000 every 5 minutes (#4450)

4553b7c

Fix bug in schedule_fuzz (#4454)

d220483

Make subconfigs weighted (#4456)

02bc46d

Increase fuzzing on batch (#4457)

c8a1809

Increase scheduled fuzz tasks to 6000 (#4460)

9d80b52

Scale up fuzzing on batch

Try to bunch tasks on batch more efficiently (#4464)

25b1452

We're experiencing big delays in our batch scheduling (meaning batch is an hour behind). We may have too many jobs for batch. Start trying to bunch the tasks more efficiently so we have fewer jobs.

Bunch more aggressively, but schedule less tasks (#4465)

3f87c5d

The experiment failed because batch is throttling job creation for some reason. Lower the amount of tasks we schedule so we don't create a backlog that prevents non-fuzz tasks from being run.

Delete trusted_host implementation of corpus pruning. (#4466)

ecdd32a

It's not used anymore. Related: b/381528482

Schedule more fuzzing jobs in us-east4 to test new batch limit (#4470)

d96fd52

End experiment saturating fuzzing (#4471)

8cf9ef0

Delete trusted_host implementation of minimize (#4469)

1dd0028

We don't use it anymore. Related: b/381528482

Fix case when tasks received

bf00a26

Fix putting tasks on queue (#4477)

af513f8

Allow utask_main scheduling

undo accidental change

48deb2a

Increase fuzz task scheduling

eec1947

Fix task retrying

f2c1504

Fix off by one error in assertion (#4520)

272f205

Also delete some example code.

Reduce number of scheduled fuzz tasks.

ca66a42

Support multiple regions for schedule_fuzz_task.py (#4525)

78374cb

Try scheduling as much as available (#4528)

024062d

Instead of trying to schedule a small amount of fuzz tasks every 2 minutes and hoping this leads to fuzzing at full capacity, just schedule almost the full amount at once.

Fix lint error (#4552)

233d773

Fix failing unittest (#4557)

8636f73

vitorguidi and others added 30 commits January 8, 2025 17:08

[Monitoring] Remove granularity for stuck testcases metric (#4496)

4ecbd36

The metric for untriaged testcae age was not considering bugs that were being filed legitimately, so there was no metric emission at all. Also, removes granularity in the stuck testcase count metric.

Revert #4499 (#4512)

70b2fd3

There is a clear way to partition ErrorType enums in the criteria proposed, reverting this one. Part of #4271

Fix fetching paginated results with a filter. (#4511)

fa765db

Filters mean that pagination isn't handled for you.

Modify get_fastboot_path() to allow for custom binaries (#4518)

24ed1ff

Custom binaries are often used emulators (i.e. the binary being used by the host). This code follows the same strategy used for the ADB binary above in `get_adb_path()`.

Don't allow Friday deploys in Chrome. (#4513)

b4e5892

This was requested by Chrome in b/384493595. With more team members joining, I think it's best to enforce this policy, than expect everyone to know it.

Fix archiving Fuzzilli test cases (#4531)

4e9615e

Calling the function was missing in #4515 Bug: https://crbug.com/362963886

Close old non reproducible bugs (#4530)

073244c

Clusterfuzz will either confirm bugs are reproducible, mark them as non reproducible and close, or say nothing. In the last case, we want to auto close any bugs that have been open for long enough for it to have tried reproducing and failed

Remove unused reload gesture (#4207)

e93e028

[NFC] Modernize string formatting in fuzz_task.py (#3811)

407d0a8

Revert #4497 (#4560)

08e5035

Reverting #4497 due to terraform apply erroring out, will reland once it is fixed.

[Monitoring] Make appengine crons emit metrics (#4572)

95df754

Appengine crons are currently not instrumented, so we cannot get testcase metrics for the google 3 deployment. This PR fixes that. Part of #4271

Add detection for Chromium NOTREACHED()s (#4579)

2fdb5f6

This makes sure that NOTREACHED()s are classified as CHECK failures and handled similarly. Before this change clusterfuzz filed a lot of issues as "Abrt in logging::NotReachedLogMessage::~NotReachedLogMessage".

Reduce bug filing threshold for libfuzzer-based Android crashes (#4474)

0762939

Updated triage.py to file bugs for fuzzer crashes by lowering the threshold limit. These fuzzers run on Android platform and on libfuzzer engine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase oss-fuzz onto master #4593

Rebase oss-fuzz onto master #4593

jonathanmetzman commented Jan 8, 2025

Rebase oss-fuzz onto master #4593

Are you sure you want to change the base?

Rebase oss-fuzz onto master #4593

Conversation

jonathanmetzman commented Jan 8, 2025