C#/Java: Content based model generation improvements. #17363

michaelnebel · 2024-09-03T12:50:56Z

If we apply the content based model generator to

The Java SDK we get around 80k models.
The .NET 8 Runtime we get around 500k models.

Many of these models are undesirable, primarily because of one of the following reasons

Due to the lack of type constraints, we get lots of models as "data" could have originated from lots of different implementations (and conversely for stores).
A model reads from a synthetic field that no other model writes to (and conversely).

With the changes proposed in this PR, applying the content based model generator to

The Java SDK we get around 4k models.
The .NET 8 Runtime we get around 5k models.

The issues are addressed in the following way:

We make a rough filtering on the APIs: Content based model generation is only considered for an API if the formula below holds for that particular API (the reasoning behind this is also listed as a code comment and the magic numbers 2 and 3 in the formula below are definitely up for debate. If another approach should be taken - e.g. maybe something like filtering APIs based on input and return types then we can discuss this as well). Applying this constraint to the target APIs from the Java SDK removes approximately 2% og the possible target APIs for summary generation, which indicates that this is not an especially invasive constraint.
- # summaries <= 2 * #number of parameters + 3
We only include a summary for an API that reads or stores into a synthetic field, if there exists a "chain" of APIs with summaries such that data originates from non-synthetic content and targets non-synthetic content. The canonical example for this are get and set methods that uses a private backing field X. In this case we would like to generate the summaries set : Argument[0] -> Argument[this].SyntheticField[X] and get : Argument[this].SyntheticField[X] -> ReturnValue, but we only want to do this in case both methods exists. If only the get method exists, then it retrieves information from a "dead" synthetic field. The synthetic chaining is currently based on access path equality but that restriction could potentially be loosened a bit, but I suspect that this only has limited impact on the generated summaries.

Note that the changes in this PR doesn't change the production version of the model generation and only impacts model generated by the recently introduced --with-contentbased-summaries.
Further changes are still required for this to be production ready, but the changes in this PR are self contained.

java/ql/src/utils/modelgenerator/internal/CaptureModels.qll

csharp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

java/ql/src/utils/modelgenerator/internal/CaptureModels.qll

… flow.

michaelnebel · 2024-09-12T08:12:02Z

DCA looks good!

hvitved

Some initial comments.

csharp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

hvitved · 2024-09-17T11:53:22Z

csharp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

+ * detection instead, as reads and stores would use a significant
+ * part of an objects internal state.
+ */
+private class ContentDataFlowSummaryTargetApi extends DataFlowSummaryTargetApi {


Alternatively, we could make the restriction per parameter. I.e., if a method has only one summary for Argument[0] then include it, but if it has more than, say 3, summaries for Argument[1] then exclude those.

Yes, that could also be interesting.
Would it be acceptable if I do that experiment as a follow up?
As it is now we "only" exclude approximately 2% of all the possible target APIs with this limitation.

csharp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

hvitved · 2024-09-17T11:56:11Z

csharp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

+    hasSyntheticContent(read) and
+    hasSyntheticContent(store) and
+    (
+      step(t1, read, t2, store)


Shouldn't we restrict to syntPathEntry here? Otherwise it looks like we will compute O(n^2) paths.

Uh - this was intended as a helper predicate, but it looks like we need to fold it into the declarations where it is used to avoid the O(n^2).

hvitved · 2024-09-17T12:03:35Z

csharp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

+   * if both of these methods exist.
+   */
+  pragma[nomagic]
+  predicate acceptReadStore(


I wonder if it is simply enough to say that we will include a synthetic field if it is part of some input specification and part of some output specification. That should also handle cases such as

setAll // input: Argument[0], output: ReturnValue.SyntheticField[Foo].Element get // input: Argument[0].SyntheticField[Foo], output: ReturnValue list = setAll("taint"); x = list[0]; sink(get(x));

I also thought about that as a possible "improvement" (instead of using "path" equality we could "hash" paths with synthetics (basically just print the synthetics in the order they are shown in the path and then compare that)). This would allow path continuations as the one you mention there.
Is it acceptable that we attempt this as a follow up?

I was actually thinking of something even simpler where the order is not taken into account, but yes follow-up is fine.

We could also try without order; However, I am quite sure that we need the "chaining" logic. It turned out that it is not uncommen that we produce summaries like SyntheticField[A] -> SyntheticField[A] (without other mentions on the synthetic field A) or SyntheticField[A] -> SyntheticField[B] and SyntheticField[B] -> SyntheticField[A], if we don't restrict the use of synthetics.

hvitved · 2024-09-19T08:18:15Z

csharp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

+   * be translated into a synthetic field.
+   *
+   * This is needed because we don't want to include summaries that reads from or
+   * stores into a "dead" synthetic field.


"internal" instead of "dead"?

Sure! I will add a commit to the re-factor PR that changes this.

github-actions bot added C# Java labels Sep 3, 2024

github-advanced-security bot found potential problems Sep 3, 2024

View reviewed changes

java/ql/src/utils/modelgenerator/internal/CaptureModels.qll Fixed Show fixed Hide fixed

michaelnebel force-pushed the modelgen/fieldbasedimprovements branch from 76b9cdd to 8473b8c Compare September 5, 2024 11:14

github-advanced-security bot found potential problems Sep 5, 2024

View reviewed changes

java/ql/src/utils/modelgenerator/internal/CaptureModels.qll Fixed Show fixed Hide fixed

michaelnebel force-pushed the modelgen/fieldbasedimprovements branch from 8473b8c to 412551f Compare September 6, 2024 09:44

github-actions bot added the DataFlow Library label Sep 6, 2024

michaelnebel force-pushed the modelgen/fieldbasedimprovements branch 3 times, most recently from b5456a3 to b7d00a4 Compare September 9, 2024 11:11

github-advanced-security bot found potential problems Sep 9, 2024

View reviewed changes

csharp/ql/src/utils/modelgenerator/internal/CaptureModels.qll Fixed Show fixed Hide fixed

java/ql/src/utils/modelgenerator/internal/CaptureModels.qll Fixed Show fixed Hide fixed

michaelnebel force-pushed the modelgen/fieldbasedimprovements branch 4 times, most recently from 4521dfd to 30f3923 Compare September 10, 2024 09:38

michaelnebel added 9 commits September 10, 2024 15:23

Shared: Add some helper predicates to the AccessPath class in content…

7c0101a

… flow.

Java: Improve content based model generation.

d2c98c8

Java: Update some model generator test cases.

d7e61d0

Java: Only keep the best generated model in terms of taint/value.

9149a17

Java: Add content based example with multiple paths.

0fbeca1

C#: Sync changes and make language specific parts.

e948902

C#: Add the capture content summary models query.

da012a7

C#: Adjust existing model generator tests and update expected output.

b94940b

C#: Add some synthetic field content based examples.

0abc08c

michaelnebel force-pushed the modelgen/fieldbasedimprovements branch from 30f3923 to 0abc08c Compare September 10, 2024 13:26

michaelnebel added the no-change-note-required This PR does not need a change note label Sep 10, 2024

michaelnebel marked this pull request as ready for review September 10, 2024 14:16

michaelnebel requested review from a team as code owners September 10, 2024 14:16

michaelnebel requested a review from hvitved September 10, 2024 14:16

hvitved reviewed Sep 17, 2024

View reviewed changes

C#/Java: Address review comments.

68165bb

michaelnebel requested a review from hvitved September 18, 2024 07:08

hvitved approved these changes Sep 19, 2024

View reviewed changes

michaelnebel merged commit 4a9e3ee into github:main Sep 19, 2024
52 checks passed

michaelnebel deleted the modelgen/fieldbasedimprovements branch September 19, 2024 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C#/Java: Content based model generation improvements. #17363

C#/Java: Content based model generation improvements. #17363

michaelnebel commented Sep 3, 2024 •

edited

Loading

michaelnebel commented Sep 12, 2024

hvitved left a comment

hvitved Sep 17, 2024

michaelnebel Sep 17, 2024

hvitved Sep 19, 2024

hvitved Sep 17, 2024

michaelnebel Sep 17, 2024

hvitved Sep 17, 2024

michaelnebel Sep 17, 2024

hvitved Sep 19, 2024

michaelnebel Sep 19, 2024 •

edited

Loading

hvitved Sep 19, 2024

michaelnebel Sep 19, 2024

C#/Java: Content based model generation improvements. #17363

C#/Java: Content based model generation improvements. #17363

Conversation

michaelnebel commented Sep 3, 2024 • edited Loading

michaelnebel commented Sep 12, 2024

hvitved left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelnebel Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelnebel commented Sep 3, 2024 •

edited

Loading

michaelnebel Sep 19, 2024 •

edited

Loading