Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++: Enable sound IR #16139

Merged
merged 8 commits into from
Apr 17, 2024
Merged

C++: Enable sound IR #16139

merged 8 commits into from
Apr 17, 2024

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Apr 5, 2024

In #2667 we made the IR unsound by default because it gave better results on IR-based dataflow. However, nowadays we don't use the IR's memory SSA, and thus we don't require unsound IR to get good dataflow results. Instead, unsound IR generates FPs on certain queries that requires the IR to be sound (cpp/uninitialized-local is one extreme example).

This PR basically undoes #2667 by flipping the switch Dave introduced in that PR so that we're back to sound IR.

As earlier DCA experiments has shown, this slows down the IR's alias alias analysis quite considerably (a 60% analysis regression) on https://github.com/nlohmann/json due to a few crazy large functions with lots of escaping variables. However, as QA shows this is the absolute worst offender across any project. In particular:

  • QA showed no new timeouts, and only a 1.3% performance regression.
  • QA confirms that this removes a ton of FPs from many queries. In particular, I suspect we can raise the precision of cpp/uninitialized-local from medium to high after this PR has been merged.

Note that we lose some IR alias tests results on smart pointers. These were added in #5737 back when we depended on the IR alias analysis for dataflow. However, as we're no longer using the IR alias analysis for dataflow this won't have any real effect.

@MathiasVP MathiasVP requested a review from a team as a code owner April 5, 2024 17:27
@github-actions github-actions bot added the C++ label Apr 5, 2024
@MathiasVP MathiasVP marked this pull request as draft April 5, 2024 17:28
@jketema
Copy link
Contributor

jketema commented Apr 8, 2024

My understanding was that there was a more serious performance regression when turning this on, but I don't see anything addressing this?

@MathiasVP
Copy link
Contributor Author

My understanding was that there was a more serious performance regression when turning this on, but I don't see anything addressing this?

That's correct. As I wrote in the top of this PR I ran the PR on QA and the regression turned out to be a lot smaller than anticipated (i.e., no timeouts and only a ~1.3% analysis time regression). So I think it's fine to go ahead with the change as-is.

When I raised this question async I didn't get any feeling that this was controversial. Although, I'm of course happy to take this up in a sync with the rest of the team if you think we should discuss this more before I pull this out of draft.

@jketema
Copy link
Contributor

jketema commented Apr 8, 2024

Thanks for the clarification. For me this is fine as-is.

Note that there are some other test regressions.

@MathiasVP
Copy link
Contributor Author

Note that there are some other test regressions.

Yeah, I forgot we actually use the IR memory SSA edges for one thing in dataflow: taint-flow through iterators :( I'll think about what to do here before pulling this out of draft.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Apr 11, 2024

This PR is good to go 🎉

We lose a few library results:

  • SSA for smart pointers: These aren't really used for anything since dataflow no longer uses the IR's SSA analysis. So even though the IR doesn't understand smart pointers dataflow still gives us all the right dataflow. Eventually, we should figure out what the story for this should be, but I don't think we need to do that now. I think the IR should ideally not need to deal with these somewhat special cases, but for now it does so because of historical reasons.
  • Taint-flow for iterators: Dataflow has basic support for tracking flow from an iterator write to a container read, but this was never very good and relied on unsound IR-based SSA. Now that we're getting sound SSA this breaks more easily (since more things escape, and thus the IR stops tracking it when it's unsure about aliasing), and we should figure out how to do this analysis better at some point. I don't think we need to do that now, though.

DCA shows the same kinds of alert changes as we've seen on earlier runs (and on QA). Lots of FP removed on cpp/uninitialized-local. A single TP on cpp/redundant-null-check-simple has been lost, but I expect that we will automatically regain that one by improving alias analysis in the future, and this is a step in the right direction.

@MathiasVP MathiasVP marked this pull request as ready for review April 11, 2024 11:09
@geoffw0
Copy link
Contributor

geoffw0 commented Apr 16, 2024

Code - changes are rather minimal and LGTM.

Tests - thanks for explaining the lost test results. It's a bit sad we don't have any positive test changes, since this is apparently motivated by fixing false positives.

DCA

  • the cpp/redundant-null-check-simple result lost does indeed look like a TP.
  • I checked a handful of the cpp/uninitialized-local lost results and, like yo, judged all to be FP or likely FP.
  • decrease in IR inconsistencies 👍

I trust you've created follow-up issues, e.g. for the issue with taint-flow for iterators.

I'd like to briefly try this out locally before approving, but LGTM.

Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a look at some results (cpp/uninitialized-local, cpp/redundant-null-check-simple and cpp/command-line-injection on MRVA top 10, with and without this change). We lose all but one result for cpp/uninitialized-local, it looks like the lost results are FPs and the remaining result is ... complicated (it has to do with a pointer-to-a-pointer being updated in a loop).

I'm happy that's an improvement, and I think it's about time we merge this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants