-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/restructure: (Literal) graph isomorphism is not viable device for CFG matching #172
Comments
Thanks a lot for bringing this to my attention! I was aware that the current design would fail to recover a structured control flow for several valid graphs, but was unaware of exactly what these graphs might look like. I will reference back to this issue in the future, when extending the capabilities of the control flow analysis. It is quite likely that subgraph isomorphism cannot satisfy all cases and must therefore be used in synergy with other control flow analysis methods. I really appreciate the mitigations ones you've already mentioned (i.e. graph normalization), and would be very keen to discuss further approaches that may solve other issues related to control flow recovery. |
Ok, I was keen to see how Cifuentes' dcc deals with this stuff. Her structuring algo isn't based on primitive graph isomorphism, but rather loadsome of dark magic, and her thesis doesn't mention jump threading at all. I found https://github.com/nemerle/dcc as a resurrection of dcc, caveat: the guy patched it heavily, so results may represent departure from the original dcc. Real pain was to generate a 16-bit DOS executable ;-). So, good news: her algo is stable regardless of jump threading. Bad news: her algo structures trivial infloop-if construct in weird way with gotos: Input:
Output:
So, she mentions importance of right order of control struct recognition, but order she suggests is subpar either (again, with a disclaimer that @nemerle broke it). |
Per nemerle/dcc#2 (comment) , I confirm that original dcc decompiler produces structuring like above, i.e. sub-ideal. Few more comments on dcc algo: while Cifuentes performs various transformations on CFG, ultimately, how structuring happens is that original CFG is always kept and just annotated based on information discovered/computed. That makes full sense, as otherwise (if you perform reductions on the original graph) it's hard to deal with completely unstructured features, like goto. Suppose for example you have an (non-local in a "local structure sub-graph" sense) edge somewhere inside a loop. Such edge will certainly will be represented by goto, but it shouldn't hamper recognition of the loop, it should be still possible to record destination of goto, even if it's inside a newly created loop node. The easiest way to achieve the latter is to perform transformations on copy of CFG, but back-annotate them to the original CFG. (Sorry if this is obvious, in my implementation I actually reduce the CFG graph.) |
Decompiled by Boomerang (built from https://github.com/nemerle/boomerang):
Can only end this with quote from Van Emmerik's (Boomerang developer) thesis: "The problem of transforming unstructured code into high level language constructs such as conditional and loops is called structuring. This problem is largely solved [reference to Cifuentes' thesis]." |
Reading Van Emmerik's paper is on my TODO list :) C. Cifuentes focused on semantical equivalence as she wanted to use the decompiled output for forensic investigations. For this reason, Cifuentes used gotos when structuring irreducible graphs. In contrast S. Moll focused on functional equivalence and structured irreducible graphs by using node splitting, which is similar to the approach you mentioned earlier. If you haven't read it already, I would recommend reading Decompilation of LLVM IR by Simon Moll. Below is an extract related to node splitting from this paper:
|
We don't talk irreducible graphs. We talk trivial if-inside-while cases. Btw, do you have another paper Van Emmerik refers to in that passage - "Structuring Assembly Programs" by D. Simon? |
I found it on web.archive.org. My PDF reader (evince) was only able to display the last page. Using Gimp however I was able to view any page, but only one at the time. Let me know if you find a reader which is capable of displaying this file. |
Using archive.org for this is good idea, thanks ;-). I see the problem - paper comes without title page, so poorly indexed by various engine. Here's an pdf: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.112.5256 |
Thanks for the PDF. Btw, you gave me the idea a few days back to take a look at how other decompilers solve the control flow recovery. I came across a few projects which may be interesting. Have you seen all of these before? |
Sure, as an open-source developer, I always do my homework to see if there's existing project which can be reused/contributed too. I gave example of my communication with EiNSTeiN-/decompiler's decompiler author in #168 (comment) e.g. I guess generic review of other open-source decompilers should be moved to a separate ticket, I mention dcc/boomerang here because: 1) they are backed by good detailed papers (well, vice-versa, papers have source code attached which allows quickly validate how advanced are proposed formalisms/algorithms); 2) they both are well-known implementation which attempt to cover general problem of decompilation, not just working on toy artificial examples. My next testing subjects in queue are smartdec and its newer incarnation snowman. That's another general decompiler, albeit not backed by papers. |
I just came across a really interesting paper titled No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations. Here is an extract from the paper:
Although not strictly related, (1) reminded me of what Cifuentes refers to as intervals in graphs, which she defines as:
I think it would be possible to locate a wide variety of high-level control flow primitives, simply by locating and working with intervals in the CFG. For instance, it should be trivial to implement multi-level To facilitate this work, I wrote a small tool which locates each interval (with two or more nodes) in a given graph. For anyone who wish to try the tool, run:
And it will produce a set of DOT files, one for each interval in GRAPH.dot. |
This is why there's saying "this is computer science, not real science" ;-). Take a vulgar boolean proliferation idea (ab)used by people for decades, apply to a new domain (decompilation, instead of redrawing flowcharts by hand, like it was used 50 years ago) and look how fresh it is! No, I mean that, paper is brilliant!
Yeah, including those which are not there. Cifuentes looks at such approaches with disapproval. There're few interesting points the paper makes. E.g., that there's (big) difference between conditional nodes and other nodes. Indeed, I had that intuitive hunch either. So, why nobody else makes that distinction? Well, because it doesn't exist! It's all the same in the general case, because conditions are also computed like everything else. I.e. there's big difference between conditions and everything else from HLL perspective, but machine code doesn't care, conclusion to which they (partially/indirectly) arrive in section IV.D Side Effects. But perhaps it's really worth knowing the fact that any CFG node with 2 exits is conditional node. This would be "d'oh" moment for a layman, because that's the definition of a conditional node. But we know that a fact that condition node has 2 exits doesn't imply that any 2-exit node is "if". Mixing up these 2 facts (well, not paying enough attention to it) is what makes dcc output "while" as "if", and then stick "goto" on top. So, properly structuring criteria for structuring CFG is prerequisite for nicely-looking structuring ;-). The key insight is that there are conditional nodes. Conditional nodes have conditions associated, and you need to manipulate these conditions to do graph structuring. I mentioned that in the original description above when presenting "quiz" (which has conditions like "(a3 >= 0x10)" and "(a3 < 0x10)", and for understanding the graph, it's important to understand that those conditions are complementary). Their lago does a lot of condition munging, and goes a far way. The bottom line is in the title of the ticket, from the beginning. Anyway, going next. They call it "Pattern-independent structuring", but come on, they cheat. They structure "if" nodes based on pattern above (if's are 2-exits with no backward edges involved, quite a pattern). And for loops, they don't search for patterns on graphs, but flatten them and search for patterns on these flattened presentation. That doesn't make it more performant (it's already NP), but quite may add another O(N) factor to existing complexity (maybe not, I can't grok such things intuitively). But if it makes resulting code more readable, it's great! I don't see the proof that's a prerequisite for producing more readable code (i.e. for recognizing control structures which are there). But again, if that algo is more understandable than other, very nice. Another point to make is that there's lot of ambiguity in structuring graphs, and it all heuristics how to do that. See Fig. 3, subgraph R1. There're clearly 2 loops in the graph, why do they structure c1-while inside c3-dowhile? It's either guided by association with real memory addresses or by the "size" of loop (if it's smaller, it should be "inside"). Both are heuristic, there's nothing like that in a pure graph. My algorithm would structure it even differently, it looks to have single back-edge, and then of course structures internal nodes as if's, not as extra loops. So, my algo would produce something like:
(my algo doesn't yet grok breaks). While is more readable - with 'if" ladders or with enclosed loops? What would original source likely contain? |
I did some reading too, but I tried to follow not "exciting new" stuff, but various alternative approaches, and more importantly, categorize them. Here's one attempt at classification: http://www.softpanorama.org/Algorithms/Decompilation/decompilation_of_control_structures.shtml And this paper both presents an algo classifies prior art (down to 1966): Using Hammock Graphs to Structure Programs, Fubo Zhang and Erik H. D’Hollander http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.70.6464 . Their algo itself works just on the same view of things as Dream's thing does, except they follow the other dichotomy of that approach: instead of proliferation on non-existing boolean expression, they proliferate existing things to places where they don't exist (node splitting). Old and boring? No, they give "compelling" reasons why it's cool! ;-) Dream's guys show level-up-from trivial graphs and show how well their algo deals with them, and that's why it "clicks" so well. What they don't do is to show cases where their algo produces bloated, sub-readable code. Well, perhaps such cases are rare in modern C-compiler code? Perhaps that's the big contribution of paper - to show that old methods of 1960ies now give non-awful results? Zhang/D’Hollander don't do it that simple, and analyze "unpleasant" cases. But again, even their output in considerably longer than alternatives, they give reasons to not outright reject their output - for their target domain, which is parallelizing compilation. In our target domain, decompilation, longer programs are unlikely good thing - if a single goto has to be worked around by non-existing condition var checked in 10 lines, then it's hard to argue that it makes it "more readable", and it makes sense to just dump that goto. |
(A lot of the above is thoughtstream, but well, I either try to record it, or they will be just lost). |
Nice. So, you read Cifuentes, she argues for interval-based approach, and you went for graph isomorphism search. Now you read paper which argues for another algo, and mentions intervals just for perspective, and you went for intervals ;-). |
Papers from which Cifuentes cited unstructured patterns: http://comjnl.oxfordjournals.org/content/20/1/45.full.pdf , http://comjnl.oxfordjournals.org/content/21/2/161.full.pdf . "This paper examines the basic flow diagram substructrures, the presence of which in a flow diagram causes the flow diagram to be unstructured, and proves that these are the only structures which lead to unstructuredness." Intuitively, if set on unstructured elements is finite an enumerable, then we can add them control structures repertoire and make any graph "structured". Realistically though, that won't help much, as these structure non-intuitive and include non single-entry/single-exit regions, which is, well, what makes them non-intuitive and non-representable in one-dimensional "structured" presentation we use. However, if we can match them in a graph, they may provide important metric of decompiler quality: for example, if such structures are not matches in a graph, but decompiler produced gotos, then the decompiler didn't do well enough. And vice versa, if such structures are matches, that those are the only places which warrant usages of gotos, all other places should be goto-free. Because well, good decompiler should not produce too many gotos, or too few (zero) of them. It should produce as few as would be used in original (well written) source code. |
And that's what our friend C does! "loop multiple exits" are breaks, "loop multiple entry" is (with a bow to Dream paper) switch-in-a-loop (that's why C's switch cases are fall-thru - the C authors did their homework ;-) (well, saying goes that C and Unix were delivered in virtual scriptures from above, but folks to whom they were destined did a lot of mistakes writing them down, and later got filled with hubris and went for their own plan9's an go's, but nobody cared ;-) ) ). |
For the record, it is very much appreciated! If anything, it is quite clear that you have a scientific background. I have learned and know I will continue to learn a great deal from you. What I've done to this point is obviously of no news to you. Hopefully you will still find it worthwhile to hang around, it's nice to have someone to discuss these things with. I especially appreciate that you point out weaknesses when you see them, as this is fundamental if one is every to improve. |
Thanks for this! |
The quality of Cifuentes' paper is what sparked my interested in reading papers. Now, I'd rather read through a paper when travelling by public transport than watching an episode of
My approach may not be structured, but it is quite enjoyable! :) |
I like this metric. Need to give it some further thought, but it resonates well with me. |
Thanks. I'm happy to dump my thoughts as long as I work on that stuff, and I don't want to drop the ball until until I've brushed up what I already have (or it likely will be "lost"). Btw, I've come up with a plan for my code - it will be standalone tool, so people could easily hack on it without dependencies, but also should be usable as a plugin for scratchabit. So, I started a repo: https://github.com/pfalcon/ScratchABlock (little interesting there so far). |
I have been following ("watching" in GitHub terminology) this repo since a few days back :) |
FYI, I flushed my previous developments to https://github.com/pfalcon/ScratchABlock , with tests. See also https://github.com/EiNSTeiN-/decompiler/issues/11#issuecomment-112245316 for another (simple) example where simple graph isomorphism won't work. |
Thanks for the notice! I've been tracking your commits for some time. When I get a chance, I'll take a deeper look and play around with it.
Thank for the example. It is true that the assembly representation of simple while loops often consists of an if statement with a do-while body. Normalization of the CFG through isomorphic transformations may help mitigate this issue. I would still like to continue investigating how intervals may affect control flow analysis, as they should be able to solve the issue of nested breaks and continues in loops. |
By all means, please do. I in the meantime started generic loop recognition based on generalized loop graph structure (surface structure, without calling for "deep" devices such as intervals), it will be interesting to compare results later. |
Wonderful. It will indeed be interesting to compare the results! |
Hi @pfalcon! Now I have the implementation complete. Haha, sure took some time. Would you still be interested in comparing results? I most definitely would. If so let me know. These are the results I get using the Hammock method (subgraph isomorphism) and Cifuentes' Interval method, when running on the test programs of GNU Coreutils (version 8.30) and SQLite (version 3.25.0). The results are to be interpreted as follows.
The Hammock method correctly recovered 42.16% of the control flow primitives of the test programs (true positives). On average, for every 10 control flow primitives of the original source code, the Hammock method recovered 3.5 control flow primitives that were not present of the original source code (false positives). The Interval method correctly recovered 74.28% of the control flow primitives of the test programs (true positives). On average, for every 10 control flow primitives of the original source code, the Interval method recovered 5.9 control flow primitives that were not present of the original source code (false positives). Cheers, |
@mewmew: Congrat on the milestone! Cute graphs ;-). Well, I'm not sure how much it would be possible to compare. I believe that generic loop structuring algo is still in my git stash, unfinished. And my ScratcABlock is definitely not up to decompiling GNU Coreutils. (In the sense that it may produce some output, but I wouldn't be brave enough to wade thru it ;-) ). |
Let's take a simplified (ifelse replaced by if) bar.dot from restructure:
restructure finds both "if" and "while" in it as expected. But in a quick and optimized loop, node
I
will be empty. Then, any half-decent compiler will perform http://en.wikipedia.org/wiki/Jump_threading which leads to a graph:restructure fails to recover structured workflow in this graph. And yet it is isomorphic to the previous graph under "jump threading" transformation!
Now the question is how much this applies to your work. In any LLVM IR produced from a structured control flow, any landing site (join node) apparently will be non-empty because of SSA - it will contain PHI. But we're talking decompiler. Did you try to take real jump-threaded assembly, convert it to SSA and match? But conversion to SSA gotta be "local" (adds PHIs to existing nodes, doesn't restructure graph), so, unless you take a step to "normalize" graph so pattern matching has its chances, you won't get far with decompilation.
For reference, in my Python prototype handling that was pretty simple. The idea is that structure loop should have only one back-edge to loop header. If there're more than 1, a common landing site node needs to be inserted, with all edges going to it, and single edge going to loop header.
So, the above isn't a challenge for you, I'm sure it's pretty simple to implement in Go too.
The challenge, if you like it, is following graph from a real-world case. Pay attention to loop at 400000f3. Intuitively, what happens there is there was a while loop with if/else inside, and compiler put it inside out, with looping happening in if/else branches. I'm leaving proving that to you, as well as specifying a transformation to normalize it to a structured form. (But note that it requires not just graph isomorphism checks, but also edge label checks - I mean, if you have this, like with this nice high-level RISC; with some awful CISC it would be more awful (and here I cheatingly simplify, because it's pure luck that in this case a value was allocated to the same register in both branches, so RISC or CISC, general solution is complex and involves SSA and symbolic computation)).
(Note that with a loop at the bottom of the graph, I kinda groked what it is and how it might be matched (or might be not, due to complexity)).
The text was updated successfully, but these errors were encountered: