-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Add type tracking for content #15711
Conversation
Due to the char-pred of Content, this change should keep exactly the same behavior as before.
This should not result in many changes, since store/load steps are still only implemented for attributes.
While it might be useful to track content to any lookup, it's not something we do right now.
Instead of just relying on the call-graph tests
I was initially surprised to see that this didn't work, until I remembered that type-tracking only works with content of depth 1.
We do this to remove the inconsistencies, and to be ready for a future where type-tracking support content tracker of depth > 1. It works because targets of loadSteps needs to be LocalSourceNodes predicate loadStep(Node nodeFrom, LocalSourceNode nodeTo, Content content) {
DCA evaluation looks fine, new results looks like TPs 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice development and good tests. It will be great to get this in! I have left some questions.
@@ -5,6 +5,7 @@ | |||
|
|||
private import internal.TypeTrackingImpl as Impl | |||
import Impl::Shared::TypeTracking<Impl::TypeTrackingInput> | |||
private import semmle.python.dataflow.new.internal.DataFlowPublic as DataFlowPublic | |||
|
|||
/** A string that may appear as the name of an attribute or access path. */ | |||
class AttributeName = Impl::TypeTrackingInput::Content; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe deprecate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, very good point 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed 👍
/** | ||
* Subset of `storeStep` that should be shared with type-tracking. | ||
*/ | ||
predicate storeStepCommon(Node nodeFrom, ContentSet c, Node nodeTo) { | ||
tupleStoreStep(nodeFrom, c, nodeTo) | ||
or | ||
dictStoreStep(nodeFrom, c, nodeTo) | ||
or | ||
moreDictStoreSteps(nodeFrom, c, nodeTo) | ||
or | ||
iterableUnpackingStoreStep(nodeFrom, c, nodeTo) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this not contain attributeStoreStep
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we characterise the steps that should go here? Are they the ones where c
is precise content? (This question is partly about maintenance of this predicate.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point 👍 let me write a comment 👍
About the attributeStoreStep, as you already found out, it's already covered in type-tracking with slightly different logic, so it would be a mistake to also include it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed 👍
or | ||
this instanceof IterableSequenceNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what the consequences of this is, did you test it in isolation? A new LocalSourceNode
feels like potentially a big thing, since many predicates rely on this definition. Also, it should fit with the semantics, that a value is introduced in this place, but I guess that is arguably the case as it pops out of some container...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what the consequences of this is, did you test it in isolation?
I did not. I think I did that part rather hastily, and it should have deserved more attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed 👍
exists(DataFlowPublic::AttrWrite a, string attrName | | ||
content.(DataFlowPublic::AttributeContent).getAttribute() = attrName and | ||
a.mayHaveAttributeName(attrName) and | ||
nodeFrom = a.getValue() and | ||
nodeTo = a.getObject() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could be subsumed by attributeStoreStep
(which could be part of storeStepCommon
), if we are willing to replace mayHaveAttributeName
with the slightly stricter getAttributeName
. Do we remember why we are strict using dataflow and not using type tracking? Is it simply that we do not yet (as in evaluation order) have access to this small local flow reasoning during dataflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be interesting to be able to share this part as well between dataflow and type-tracking, but for this PR I didn't want to deal with that complexity, so I just kept the existing behavior.
class Content extends DataFlowPublic::Content { | ||
Content() { | ||
// TODO: for now, it's not 100% clear if should support non-precise content in | ||
// type-tracking, or if it will lead to bad results. We start with only allowing | ||
// precise content, which should always be a good improvement! It also simplifies | ||
// the process of examining new results from non-precise content steps in the | ||
// future, since you will _only_ have to look over the results from the new | ||
// non-precise steps. | ||
this instanceof DataFlowPublic::AttributeContent | ||
or | ||
this instanceof DataFlowPublic::DictionaryElementContent | ||
or | ||
this instanceof DataFlowPublic::TupleElementContent | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is better to define a subset of Content
called PreciseContent
and refer to it here. I guess it depends on whether we expect to update content and want to remember also updating the list of precise content before we start editing this definition to not refer to precise content anymore. And on whether the concept of precise content might have value on its own...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, did you consider using ContentSet
here like Ruby does (and like we do above for SummaryTypeTracker::Input
)? It seems like, since we are in the area, we might as well be future proof...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we start defining the same set of "precise content" again in the future, I'm all for it 👍
I think it would be nice to move to `ContentSet like Ruby, but let's keep that to an other PR 👍
exists(DataFlowPublic::AttrRead a, string attrName | | ||
content.(DataFlowPublic::AttributeContent).getAttribute() = attrName and | ||
a.mayHaveAttributeName(attrName) and | ||
nodeFrom = a.getObject() and | ||
nodeTo = a | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as to the store step.
When looking things over a bit more, we could actually exclude the steps that would never be used instead. A much more involved solution, but more performance oriented and clear in terms of what is supported (at least until we start supporting type-tracking with more than depth 1 access-path, if that ever happens)
Did new performance test just to ensure the new step exclusions are not causing really bad join orders 🤔 EDIT: Performance looks comparable with the new commits 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting approach, excluding impossible steps. I guess that if we revamp our use of content, simplifying to fewer kinds, and if that then makes simplification of iterable unpacking possible (because we can get rid of conversion steps) we might be able to delete these exclusions again...
Adds support for tracking more than just attribute content in type-trackers.
Adding the store/read steps is the most important part.
The tests I wrote along the way were against the call-graph. This was convenient since I knew them already, and I it's the motivation for adding this functionality... it might have been more prudent to expand the core type-tracking tests initially instead, but that got done as the very last commit only.
The fact that we don't handle
my_dict.get("some_key")
was very surprising. I don't want to handle that aspect in this PR though, but will examine it in a followup PR. For now it lives in my own fork of the repo: RasmusWL#113