Python: Add type tracking for content #15711

RasmusWL · 2024-02-23T14:29:29Z

Adds support for tracking more than just attribute content in type-trackers.

Adding the store/read steps is the most important part.

The tests I wrote along the way were against the call-graph. This was convenient since I knew them already, and I it's the motivation for adding this functionality... it might have been more prudent to expand the core type-tracking tests initially instead, but that got done as the very last commit only.

The fact that we don't handle my_dict.get("some_key") was very surprising. I don't want to handle that aspect in this PR though, but will examine it in a followup PR. For now it lives in my own fork of the repo: RasmusWL#113

Due to the char-pred of Content, this change should keep exactly the same behavior as before.

This should not result in many changes, since store/load steps are still only implemented for attributes.

While it might be useful to track content to any lookup, it's not something we do right now.

Instead of just relying on the call-graph tests

I was initially surprised to see that this didn't work, until I remembered that type-tracking only works with content of depth 1.

We do this to remove the inconsistencies, and to be ready for a future where type-tracking support content tracker of depth > 1. It works because targets of loadSteps needs to be LocalSourceNodes predicate loadStep(Node nodeFrom, LocalSourceNode nodeTo, Content content) {

At least for now :)

RasmusWL · 2024-03-19T10:22:16Z

DCA evaluation looks fine, new results looks like TPs 👍

yoff

Nice development and good tests. It will be great to get this in! I have left some questions.

python/ql/lib/semmle/python/dataflow/new/TypeTracking.qll

yoff · 2024-03-19T10:57:09Z

python/ql/lib/semmle/python/dataflow/new/TypeTracking.qll

@@ -5,6 +5,7 @@

 private import internal.TypeTrackingImpl as Impl
 import Impl::Shared::TypeTracking<Impl::TypeTrackingInput>
+private import semmle.python.dataflow.new.internal.DataFlowPublic as DataFlowPublic

 /** A string that may appear as the name of an attribute or access path. */
 class AttributeName = Impl::TypeTrackingInput::Content;


Maybe deprecate this?

yes, very good point 👍

yoff · 2024-03-19T11:03:15Z

python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

+/**
+ * Subset of `storeStep` that should be shared with type-tracking.
+ */
+predicate storeStepCommon(Node nodeFrom, ContentSet c, Node nodeTo) {
+  tupleStoreStep(nodeFrom, c, nodeTo)
+  or
+  dictStoreStep(nodeFrom, c, nodeTo)
+  or
+  moreDictStoreSteps(nodeFrom, c, nodeTo)
+  or
+  iterableUnpackingStoreStep(nodeFrom, c, nodeTo)
+}


Why does this not contain attributeStoreStep?

Also, can we characterise the steps that should go here? Are they the ones where c is precise content? (This question is partly about maintenance of this predicate.)

Good point 👍 let me write a comment 👍

About the attributeStoreStep, as you already found out, it's already covered in type-tracking with slightly different logic, so it would be a mistake to also include it here.

yoff · 2024-03-19T11:06:24Z

python/ql/lib/semmle/python/dataflow/new/internal/LocalSources.qll

+    or
+    this instanceof IterableSequenceNode


I wonder what the consequences of this is, did you test it in isolation? A new LocalSourceNode feels like potentially a big thing, since many predicates rely on this definition. Also, it should fit with the semantics, that a value is introduced in this place, but I guess that is arguably the case as it pops out of some container...

I wonder what the consequences of this is, did you test it in isolation?

I did not. I think I did that part rather hastily, and it should have deserved more attention.

yoff · 2024-03-19T11:14:55Z

python/ql/lib/semmle/python/dataflow/new/internal/TypeTrackingImpl.qll

+    exists(DataFlowPublic::AttrWrite a, string attrName |
+      content.(DataFlowPublic::AttributeContent).getAttribute() = attrName and
+      a.mayHaveAttributeName(attrName) and
      nodeFrom = a.getValue() and
      nodeTo = a.getObject()
    )


This looks like it could be subsumed by attributeStoreStep (which could be part of storeStepCommon), if we are willing to replace mayHaveAttributeName with the slightly stricter getAttributeName. Do we remember why we are strict using dataflow and not using type tracking? Is it simply that we do not yet (as in evaluation order) have access to this small local flow reasoning during dataflow?

I think it would be interesting to be able to share this part as well between dataflow and type-tracking, but for this PR I didn't want to deal with that complexity, so I just kept the existing behavior.

yoff · 2024-03-19T11:17:38Z

python/ql/lib/semmle/python/dataflow/new/internal/TypeTrackingImpl.qll

+  class Content extends DataFlowPublic::Content {
+    Content() {
+      // TODO: for now, it's not 100% clear if should support non-precise content in
+      // type-tracking, or if it will lead to bad results. We start with only allowing
+      // precise content, which should always be a good improvement! It also simplifies
+      // the process of examining new results from non-precise content steps in the
+      // future, since you will _only_ have to look over the results from the new
+      // non-precise steps.
+      this instanceof DataFlowPublic::AttributeContent
+      or
+      this instanceof DataFlowPublic::DictionaryElementContent
+      or
+      this instanceof DataFlowPublic::TupleElementContent
+    }


I wonder if it is better to define a subset of Content called PreciseContent and refer to it here. I guess it depends on whether we expect to update content and want to remember also updating the list of precise content before we start editing this definition to not refer to precise content anymore. And on whether the concept of precise content might have value on its own...

Also, did you consider using ContentSet here like Ruby does (and like we do above for SummaryTypeTracker::Input)? It seems like, since we are in the area, we might as well be future proof...

If we start defining the same set of "precise content" again in the future, I'm all for it 👍

I think it would be nice to move to `ContentSet like Ruby, but let's keep that to an other PR 👍

yoff · 2024-03-19T11:23:16Z

python/ql/lib/semmle/python/dataflow/new/internal/TypeTrackingImpl.qll

+    exists(DataFlowPublic::AttrRead a, string attrName |
+      content.(DataFlowPublic::AttributeContent).getAttribute() = attrName and
+      a.mayHaveAttributeName(attrName) and
      nodeFrom = a.getObject() and
      nodeTo = a
    )


Similar comment as to the store step.

When looking things over a bit more, we could actually exclude the steps that would never be used instead. A much more involved solution, but more performance oriented and clear in terms of what is supported (at least until we start supporting type-tracking with more than depth 1 access-path, if that ever happens)

RasmusWL · 2024-04-02T14:53:11Z

Did new performance test just to ensure the new step exclusions are not causing really bad join orders 🤔

EDIT: Performance looks comparable with the new commits 👍

yoff

Interesting approach, excluding impossible steps. I guess that if we revamp our use of content, simplifying to fewer kinds, and if that then makes simplification of iterable unpacking possible (because we can get rid of conversion steps) we might be able to delete these exclusions again...

github-actions bot added the Python label Feb 23, 2024

RasmusWL force-pushed the tt-content branch from 935182c to e9c5197 Compare March 1, 2024 11:18

RasmusWL force-pushed the tt-content branch from e9c5197 to f9f5775 Compare March 12, 2024 16:39

github-actions bot added the documentation label Mar 12, 2024

RasmusWL added 17 commits March 14, 2024 10:42

Python: Prepare for general content in type-tracker

fc8caa6

Due to the char-pred of Content, this change should keep exactly the same behavior as before.

Python: Allow general content in type-tracker

636cf61

This should not result in many changes, since store/load steps are still only implemented for attributes.

Python: Setup shared read/store steps

7721fb3

Python: Expand function reference in content test

a95bb7c

Python: type-track through tuple content

ece8245

Python: type-tracking through dictionary construction

73fe596

Python: type-track through dict-updates

dac2b57

Python: Expand dict update tests

0cf3fe4

Python: Support iterable unpacking in type-tracking

92729db

Python: Accept consistency failure

8a7ffac

Python: Ignore consistency failure

4d78762

Python: Expand dict-content tt test even more

fa0c4e1

While it might be useful to track content to any lookup, it's not something we do right now.

Python: Add proper type-tracking tests for content

7de304b

Instead of just relying on the call-graph tests

Python: Add change-note

2b09b08

Python: Fixup deprecated type-tracker API

af8cef5

Python: Expand type-tracking tests with nested tuples

6ffaad1

I was initially surprised to see that this didn't work, until I remembered that type-tracking only works with content of depth 1.

RasmusWL force-pushed the tt-content branch from 6811d73 to 7a3ee0f Compare March 14, 2024 09:47

RasmusWL added 3 commits March 15, 2024 10:14

Python: Update ssa-compute test expectations

00f2a6a

Python: Accept .expected for typetracking-summaries

6babb2f

Python: Restrict type-tracking content to only be precise

7eb4419

At least for now :)

RasmusWL marked this pull request as ready for review March 15, 2024 13:15

RasmusWL requested a review from a team as a code owner March 15, 2024 13:15

yoff reviewed Mar 19, 2024

View reviewed changes

Python: Deprecate AttributeName

20202ab

RasmusWL added 2 commits April 2, 2024 13:26

Python: Add comments around storeStepCommon

8707a63

RasmusWL requested a review from yoff April 2, 2024 14:51

yoff approved these changes Apr 8, 2024

View reviewed changes

yoff merged commit 1048cf7 into github:main Apr 9, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Add type tracking for content #15711

Python: Add type tracking for content #15711

RasmusWL commented Feb 23, 2024 •

edited

Loading

RasmusWL commented Mar 19, 2024

yoff left a comment

yoff Mar 19, 2024

RasmusWL Mar 21, 2024

RasmusWL Apr 2, 2024

yoff Mar 19, 2024

yoff Mar 19, 2024

RasmusWL Mar 21, 2024

RasmusWL Apr 2, 2024

yoff Mar 19, 2024

RasmusWL Mar 21, 2024

RasmusWL Apr 2, 2024

yoff Mar 19, 2024

RasmusWL Mar 21, 2024

yoff Mar 19, 2024

yoff Mar 19, 2024

RasmusWL Mar 21, 2024

yoff Mar 19, 2024

RasmusWL commented Apr 2, 2024 •

edited

Loading

yoff left a comment

Python: Add type tracking for content #15711

Python: Add type tracking for content #15711

Conversation

RasmusWL commented Feb 23, 2024 • edited Loading

RasmusWL commented Mar 19, 2024

yoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RasmusWL commented Apr 2, 2024 • edited Loading

yoff left a comment

Choose a reason for hiding this comment

RasmusWL commented Feb 23, 2024 •

edited

Loading

RasmusWL commented Apr 2, 2024 •

edited

Loading