Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/restructure: (Literal) graph isomorphism is not viable device for CFG matching #172

Open
pfalcon opened this issue May 25, 2015 · 29 comments

Comments

@pfalcon
Copy link

pfalcon commented May 25, 2015

Let's take a simplified (ifelse replaced by if) bar.dot from restructure:

digraph bar {
    E -> F
    E -> J
    F -> G
    F -> I
    G -> I
    I -> E
    E [label="entry"]
    F
    G
    I
    J [label="exit"]
}

restructure finds both "if" and "while" in it as expected. But in a quick and optimized loop, node I will be empty. Then, any half-decent compiler will perform http://en.wikipedia.org/wiki/Jump_threading which leads to a graph:

digraph bar {
    E -> F
    E -> J
    F -> G
    F -> E
    G -> I
    I -> E
    E [label="entry"]
    F
    G
    I
    J [label="exit"]
}

restructure fails to recover structured workflow in this graph. And yet it is isomorphic to the previous graph under "jump threading" transformation!

Now the question is how much this applies to your work. In any LLVM IR produced from a structured control flow, any landing site (join node) apparently will be non-empty because of SSA - it will contain PHI. But we're talking decompiler. Did you try to take real jump-threaded assembly, convert it to SSA and match? But conversion to SSA gotta be "local" (adds PHIs to existing nodes, doesn't restructure graph), so, unless you take a step to "normalize" graph so pattern matching has its chances, you won't get far with decompilation.

For reference, in my Python prototype handling that was pretty simple. The idea is that structure loop should have only one back-edge to loop header. If there're more than 1, a common landing site node needs to be inserted, with all edges going to it, and single edge going to loop header.

So, the above isn't a challenge for you, I'm sure it's pretty simple to implement in Go too.

The challenge, if you like it, is following graph from a real-world case. Pay attention to loop at 400000f3. Intuitively, what happens there is there was a while loop with if/else inside, and compiler put it inside out, with looping happening in if/else branches. I'm leaving proving that to you, as well as specifying a transformation to normalize it to a structured form. (But note that it requires not just graph isomorphism checks, but also edge label checks - I mean, if you have this, like with this nice high-level RISC; with some awful CISC it would be more awful (and here I cheatingly simplify, because it's pure luck that in this case a value was allocated to the same register in both branches, so RISC or CISC, general solution is complex and involves SSA and symbolic computation)).

(Note that with a loop at the bottom of the graph, I kinda groked what it is and how it might be matched (or might be not, due to complexity)).

    digraph test {
        test_entry [label="test"];
        test_entry -> test_BBlock_400000a4;
        test_BBlock_400000ba [label="BBlock\n400000ba"];
        test_BBlock_400000ba -> test_BBlock_400000c0 [label="(a3 != 0)"];
        test_BBlock_400000ba -> test_BBlock_400000bd [label="else"];
        test_BBlock_40000146 [label="BBlock\n40000146"];
        test_BBlock_400000bd [label="BBlock\n400000bd"];
        test_BBlock_400000bd -> test_BBlock_400000c0 ;
        test_BBlock_400000f9 [label="BBlock\n400000f9"];
        test_BBlock_400000f9 -> test_BBlock_400000f0 [label="(a3 >= 0x10)"];
        test_BBlock_400000f9 -> test_BBlock_40000102 [label="else"];
        test_BBlock_400000f3 [label="BBlock\n400000f3"];
        test_BBlock_400000f3 -> test_BBlock_400000e0 [label="(a3 == a6)"];
        test_BBlock_400000f3 -> test_BBlock_400000f9 [label="else"];
        test_BBlock_40000140 [label="BBlock\n40000140"];
        test_BBlock_40000140 -> test_BBlock_40000124 [label="(a3 != 0)"];
        test_BBlock_40000140 -> test_BBlock_40000143 [label="else"];
        test_BBlock_4000011c [label="BBlock\n4000011c"];
        test_BBlock_4000011c -> test_BBlock_40000146 [label="(a2 == 0)"];
        test_BBlock_4000011c -> test_BBlock_40000124 [label="else"];
        test_BBlock_40000102 [label="BBlock\n40000102"];
        test_BBlock_40000102 -> test_BBlock_40000105 ;
        test_BBlock_4000013b [label="BBlock\n4000013b"];
        test_BBlock_4000013b -> test_BBlock_40000124 ;
        test_BBlock_40000130 [label="BBlock\n40000130"];
        test_BBlock_40000130 -> test_BBlock_40000130 [label="(a3 < a4)"];
        test_BBlock_40000130 -> test_BBlock_4000013b [label="else"];
        test_BBlock_40000105 [label="BBlock\n40000105"];
        test_BBlock_40000105 -> test_BBlock_4000010d ;
        test_BBlock_40000143 [label="BBlock\n40000143"];
        test_BBlock_40000143 -> test_BBlock_40000124 [label="(a5 != 0)"];
        test_BBlock_40000143 -> test_BBlock_40000146 [label="else"];
        test_BBlock_400000c0 [label="BBlock\n400000c0"];
        test_BBlock_400000c0 -> test_BBlock_400000f3 ;
        test_BBlock_40000124 [label="BBlock\n40000124"];
        test_BBlock_40000124 -> test_BBlock_40000140 [label="(a3 >= a4)"];
        test_BBlock_40000124 -> test_BBlock_40000130 [label="else"];
        test_BBlock_400000f0 [label="BBlock\n400000f0"];
        test_BBlock_400000f0 -> test_BBlock_400000f3 ;
        test_BBlock_400000e0 [label="BBlock\n400000e0"];
        test_BBlock_400000e0 -> test_BBlock_40000105 [label="(a3 < 0x10)"];
        test_BBlock_400000e0 -> test_BBlock_400000f0 [label="else"];
        test_BBlock_400000a4 [label="BBlock\n400000a4"];
        test_BBlock_400000a4 -> test_BBlock_400000c0 [label="(a2 == 0)"];
        test_BBlock_400000a4 -> test_BBlock_400000ba [label="else"];
        test_BBlock_4000010d [label="BBlock\n4000010d"];
        test_BBlock_4000010d -> test_BBlock_4000010d [label="(a3 >= 0x10)"];
        test_BBlock_4000010d -> test_BBlock_4000011c [label="else"];
    }
@mewmew
Copy link
Member

mewmew commented May 25, 2015

For reference, in my Python prototype handling that was pretty simple. The idea is that structure loop should have only one back-edge to loop header. If there're more than 1, a common landing site node needs to be inserted, with all edges going to it, and single edge going to loop header.

Thanks a lot for bringing this to my attention! I was aware that the current design would fail to recover a structured control flow for several valid graphs, but was unaware of exactly what these graphs might look like.

I will reference back to this issue in the future, when extending the capabilities of the control flow analysis. It is quite likely that subgraph isomorphism cannot satisfy all cases and must therefore be used in synergy with other control flow analysis methods. I really appreciate the mitigations ones you've already mentioned (i.e. graph normalization), and would be very keen to discuss further approaches that may solve other issues related to control flow recovery.

@pfalcon
Copy link
Author

pfalcon commented May 27, 2015

Ok, I was keen to see how Cifuentes' dcc deals with this stuff. Her structuring algo isn't based on primitive graph isomorphism, but rather loadsome of dark magic, and her thesis doesn't mention jump threading at all. I found https://github.com/nemerle/dcc as a resurrection of dcc, caveat: the guy patched it heavily, so results may represent departure from the original dcc. Real pain was to generate a 16-bit DOS executable ;-).

So, good news: her algo is stable regardless of jump threading. Bad news: her algo structures trivial infloop-if construct in weird way with gotos:

Input:

int a, b, c;

int main() {
    while (c) {
        if (a) {
            b = 3;
        }
    }
}

Output:

int loc1; /* ax */

    while (((var00028 & loc1) != 0)) {
l1:         var00024 = 3;
    }   /* end of while */
    goto L1;

So, she mentions importance of right order of control struct recognition, but order she suggests is subpar either (again, with a disclaimer that @nemerle broke it).

@pfalcon
Copy link
Author

pfalcon commented May 30, 2015

Per nemerle/dcc#2 (comment) , I confirm that original dcc decompiler produces structuring like above, i.e. sub-ideal.

Few more comments on dcc algo: while Cifuentes performs various transformations on CFG, ultimately, how structuring happens is that original CFG is always kept and just annotated based on information discovered/computed. That makes full sense, as otherwise (if you perform reductions on the original graph) it's hard to deal with completely unstructured features, like goto. Suppose for example you have an (non-local in a "local structure sub-graph" sense) edge somewhere inside a loop. Such edge will certainly will be represented by goto, but it shouldn't hamper recognition of the loop, it should be still possible to record destination of goto, even if it's inside a newly created loop node. The easiest way to achieve the latter is to perform transformations on copy of CFG, but back-annotate them to the original CFG. (Sorry if this is obvious, in my implementation I actually reduce the CFG graph.)

@pfalcon
Copy link
Author

pfalcon commented May 30, 2015

int a, b, c;

int main() {
    while (1) {
        if (a) {
            b = 3;
        }
    }
}

Decompiled by Boomerang (built from https://github.com/nemerle/boomerang):

__size32 a;// 4 bytes
__size32 b;// 4 bytes
// address: 0x80483ed
void main() {
L-1094342288:
L-1094342288:
    if (a == 0) {
        b = 3;
        goto L-1094342288;
    }
    goto L-1094342288;
}

Can only end this with quote from Van Emmerik's (Boomerang developer) thesis: "The problem of transforming unstructured code into high level language constructs such as conditional and loops is called structuring. This problem is largely solved [reference to Cifuentes' thesis]."

@mewmew
Copy link
Member

mewmew commented May 31, 2015

Can only end this with quote from Van Emmerik's (Boomerang developer) thesis: "The problem of transforming unstructured code into high level language constructs such as conditional and loops is called structuring. This problem is largely solved [reference to Cifuentes' thesis]."

Reading Van Emmerik's paper is on my TODO list :)

C. Cifuentes focused on semantical equivalence as she wanted to use the decompiled output for forensic investigations. For this reason, Cifuentes used gotos when structuring irreducible graphs.

In contrast S. Moll focused on functional equivalence and structured irreducible graphs by using node splitting, which is similar to the approach you mentioned earlier. If you haven't read it already, I would recommend reading Decompilation of LLVM IR by Simon Moll. Below is an extract related to node splitting from this paper:

Node Splitting is an acknowledged technique for establishing control-flow reducibility. In fact, randomly splitting nodes in a CFG will eventually yield a reducible CFG. However, it is desirable to reduce the amount of copies in the process.

@pfalcon
Copy link
Author

pfalcon commented May 31, 2015

For this reason, Cifuentes used gotos when structuring irreducible graphs.

We don't talk irreducible graphs. We talk trivial if-inside-while cases.

Btw, do you have another paper Van Emmerik refers to in that passage - "Structuring Assembly Programs" by D. Simon?

@mewmew
Copy link
Member

mewmew commented May 31, 2015

I found it on web.archive.org. My PDF reader (evince) was only able to display the last page. Using Gimp however I was able to view any page, but only one at the time. Let me know if you find a reader which is capable of displaying this file.

Structuring Assembly Programs

@pfalcon
Copy link
Author

pfalcon commented Jun 1, 2015

Using archive.org for this is good idea, thanks ;-). I see the problem - paper comes without title page, so poorly indexed by various engine. Here's an pdf: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.112.5256

@mewmew
Copy link
Member

mewmew commented Jun 1, 2015

Thanks for the PDF. Btw, you gave me the idea a few days back to take a look at how other decompilers solve the control flow recovery. I came across a few projects which may be interesting. Have you seen all of these before?

@pfalcon
Copy link
Author

pfalcon commented Jun 1, 2015

Sure, as an open-source developer, I always do my homework to see if there's existing project which can be reused/contributed too. I gave example of my communication with EiNSTeiN-/decompiler's decompiler author in #168 (comment) e.g.

I guess generic review of other open-source decompilers should be moved to a separate ticket, I mention dcc/boomerang here because: 1) they are backed by good detailed papers (well, vice-versa, papers have source code attached which allows quickly validate how advanced are proposed formalisms/algorithms); 2) they both are well-known implementation which attempt to cover general problem of decompilation, not just working on toy artificial examples.

My next testing subjects in queue are smartdec and its newer incarnation snowman. That's another general decompiler, albeit not backed by papers.

@mewmew
Copy link
Member

mewmew commented Jun 1, 2015

I just came across a really interesting paper titled No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantics-Preserving Transformations.

Here is an extract from the paper:

In this paper, we overcome the limitations of structural analysis and improve the state of the art by presenting a novel approach to control-flow structuring that is able to recover
all high-level control constructs and produce structured code without a single goto statement. To the best of our knowledge, this is the first control-flow structuring algorithm to offer a
completely goto -free output.

The key intuition behind our approach is based on two observations: (1) high-level control
constructs have a single entry point and a single successor point, and (2) the type and nesting of high-level control constructs are reflected by the logical conditions that determine
when CFG nodes are reached.

Although not strictly related, (1) reminded me of what Cifuentes refers to as intervals in graphs, which she defines as:

Definition 57: Given a node h, an interval I(h) is the maximal, single-entry subgraph in which h is the only entry node and in which all closed paths contain h. The unique interval node h is called the interval head or simply the header node.

I think it would be possible to locate a wide variety of high-level control flow primitives, simply by locating and working with intervals in the CFG. For instance, it should be trivial to implement multi-level continue and break statements in nested loops, by identifying conditionals within intervals of loops which either branch to the interval header (i.e. continue) or the interval exit (i.e. break). I will play around with this idea during the next couple of days.

To facilitate this work, I wrote a small tool which locates each interval (with two or more nodes) in a given graph. For anyone who wish to try the tool, run:

$ go get decomp.org/x/graphs/cmd/interval
$ interval GRAPH.dot

And it will produce a set of DOT files, one for each interval in GRAPH.dot.

@pfalcon
Copy link
Author

pfalcon commented Jun 3, 2015

To the best of our knowledge, this is the first control-flow structuring algorithm to offer a completely goto-free output.

This is why there's saying "this is computer science, not real science" ;-). Take a vulgar boolean proliferation idea (ab)used by people for decades, apply to a new domain (decompilation, instead of redrawing flowcharts by hand, like it was used 50 years ago) and look how fresh it is! No, I mean that, paper is brilliant!

control-flow structuring that is able to recover all high-level control constructs

Yeah, including those which are not there. Cifuentes looks at such approaches with disapproval.

There're few interesting points the paper makes. E.g., that there's (big) difference between conditional nodes and other nodes. Indeed, I had that intuitive hunch either. So, why nobody else makes that distinction? Well, because it doesn't exist! It's all the same in the general case, because conditions are also computed like everything else. I.e. there's big difference between conditions and everything else from HLL perspective, but machine code doesn't care, conclusion to which they (partially/indirectly) arrive in section IV.D Side Effects.

But perhaps it's really worth knowing the fact that any CFG node with 2 exits is conditional node. This would be "d'oh" moment for a layman, because that's the definition of a conditional node. But we know that a fact that condition node has 2 exits doesn't imply that any 2-exit node is "if". Mixing up these 2 facts (well, not paying enough attention to it) is what makes dcc output "while" as "if", and then stick "goto" on top. So, properly structuring criteria for structuring CFG is prerequisite for nicely-looking structuring ;-). The key insight is that there are conditional nodes. Conditional nodes have conditions associated, and you need to manipulate these conditions to do graph structuring. I mentioned that in the original description above when presenting "quiz" (which has conditions like "(a3 >= 0x10)" and "(a3 < 0x10)", and for understanding the graph, it's important to understand that those conditions are complementary). Their lago does a lot of condition munging, and goes a far way. The bottom line is in the title of the ticket, from the beginning.

Anyway, going next. They call it "Pattern-independent structuring", but come on, they cheat. They structure "if" nodes based on pattern above (if's are 2-exits with no backward edges involved, quite a pattern). And for loops, they don't search for patterns on graphs, but flatten them and search for patterns on these flattened presentation. That doesn't make it more performant (it's already NP), but quite may add another O(N) factor to existing complexity (maybe not, I can't grok such things intuitively). But if it makes resulting code more readable, it's great! I don't see the proof that's a prerequisite for producing more readable code (i.e. for recognizing control structures which are there). But again, if that algo is more understandable than other, very nice.

Another point to make is that there's lot of ambiguity in structuring graphs, and it all heuristics how to do that. See Fig. 3, subgraph R1. There're clearly 2 loops in the graph, why do they structure c1-while inside c3-dowhile? It's either guided by association with real memory addresses or by the "size" of loop (if it's smaller, it should be "inside"). Both are heuristic, there's nothing like that in a pure graph. My algorithm would structure it even differently, it looks to have single back-edge, and then of course structures internal nodes as if's, not as extra loops. So, my algo would produce something like:

do {
  if (c1) {
    n1
  } else {
    if (c2) break
    n3
  }
} while (c3)

(my algo doesn't yet grok breaks). While is more readable - with 'if" ladders or with enclosed loops? What would original source likely contain?

@pfalcon
Copy link
Author

pfalcon commented Jun 3, 2015

I did some reading too, but I tried to follow not "exciting new" stuff, but various alternative approaches, and more importantly, categorize them.

Here's one attempt at classification: http://www.softpanorama.org/Algorithms/Decompilation/decompilation_of_control_structures.shtml

And this paper both presents an algo classifies prior art (down to 1966): Using Hammock Graphs to Structure Programs, Fubo Zhang and Erik H. D’Hollander http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.70.6464 . Their algo itself works just on the same view of things as Dream's thing does, except they follow the other dichotomy of that approach: instead of proliferation on non-existing boolean expression, they proliferate existing things to places where they don't exist (node splitting). Old and boring? No, they give "compelling" reasons why it's cool! ;-)

Dream's guys show level-up-from trivial graphs and show how well their algo deals with them, and that's why it "clicks" so well. What they don't do is to show cases where their algo produces bloated, sub-readable code. Well, perhaps such cases are rare in modern C-compiler code? Perhaps that's the big contribution of paper - to show that old methods of 1960ies now give non-awful results?

Zhang/D’Hollander don't do it that simple, and analyze "unpleasant" cases. But again, even their output in considerably longer than alternatives, they give reasons to not outright reject their output - for their target domain, which is parallelizing compilation. In our target domain, decompilation, longer programs are unlikely good thing - if a single goto has to be worked around by non-existing condition var checked in 10 lines, then it's hard to argue that it makes it "more readable", and it makes sense to just dump that goto.

@pfalcon
Copy link
Author

pfalcon commented Jun 3, 2015

(A lot of the above is thoughtstream, but well, I either try to record it, or they will be just lost).

@pfalcon
Copy link
Author

pfalcon commented Jun 3, 2015

I wrote a small tool which locates each interval

Nice. So, you read Cifuentes, she argues for interval-based approach, and you went for graph isomorphism search. Now you read paper which argues for another algo, and mentions intervals just for perspective, and you went for intervals ;-).

@pfalcon
Copy link
Author

pfalcon commented Jun 3, 2015

Papers from which Cifuentes cited unstructured patterns: http://comjnl.oxfordjournals.org/content/20/1/45.full.pdf , http://comjnl.oxfordjournals.org/content/21/2/161.full.pdf . "This paper examines the basic flow diagram substructrures, the presence of which in a flow diagram causes the flow diagram to be unstructured, and proves that these are the only structures which lead to unstructuredness."

Intuitively, if set on unstructured elements is finite an enumerable, then we can add them control structures repertoire and make any graph "structured". Realistically though, that won't help much, as these structure non-intuitive and include non single-entry/single-exit regions, which is, well, what makes them non-intuitive and non-representable in one-dimensional "structured" presentation we use.

However, if we can match them in a graph, they may provide important metric of decompiler quality: for example, if such structures are not matches in a graph, but decompiler produced gotos, then the decompiler didn't do well enough. And vice versa, if such structures are matches, that those are the only places which warrant usages of gotos, all other places should be goto-free. Because well, good decompiler should not produce too many gotos, or too few (zero) of them. It should produce as few as would be used in original (well written) source code.

@pfalcon
Copy link
Author

pfalcon commented Jun 3, 2015

Intuitively, if set on unstructured elements is finite an enumerable, then we can add them control structures repertoire and make any graph "structured".

And that's what our friend C does! "loop multiple exits" are breaks, "loop multiple entry" is (with a bow to Dream paper) switch-in-a-loop (that's why C's switch cases are fall-thru - the C authors did their homework ;-) (well, saying goes that C and Unix were delivered in virtual scriptures from above, but folks to whom they were destined did a lot of mistakes writing them down, and later got filled with hubris and went for their own plan9's an go's, but nobody cared ;-) ) ).

@mewmew
Copy link
Member

mewmew commented Jun 3, 2015

(A lot of the above is thoughtstream, but well, I either try to record it, or they will be just lost).

For the record, it is very much appreciated! If anything, it is quite clear that you have a scientific background. I have learned and know I will continue to learn a great deal from you. What I've done to this point is obviously of no news to you. Hopefully you will still find it worthwhile to hang around, it's nice to have someone to discuss these things with. I especially appreciate that you point out weaknesses when you see them, as this is fundamental if one is every to improve.

@mewmew
Copy link
Member

mewmew commented Jun 3, 2015

Here's one attempt at classification: http://www.softpanorama.org/Algorithms/Decompilation/decompilation_of_control_structures.shtml

Thanks for this!

@mewmew
Copy link
Member

mewmew commented Jun 3, 2015

Nice. So, you read Cifuentes

The quality of Cifuentes' paper is what sparked my interested in reading papers. Now, I'd rather read through a paper when travelling by public transport than watching an episode of <insert random series here>. It's obviously quite new to me, but well written papers are seductive.

she argues for interval-based approach, and you went for graph isomorphism search. Now you read paper which argues for another algo, and mentions intervals just for perspective, and you went for intervals ;-).

My approach may not be structured, but it is quite enjoyable! :)

@mewmew
Copy link
Member

mewmew commented Jun 3, 2015

However, if we can match them in a graph, they may provide important metric of decompiler quality: for example, if such structures are not matches in a graph, but decompiler produced gotos, then the decompiler didn't do well enough. And vice versa, if such structures are matches, that those are the only places which warrant usages of gotos, all other places should be goto-free. Because well, good decompiler should not produce too many gotos, or too few (zero) of them. It should produce as few as would be used in original (well written) source code.

I like this metric. Need to give it some further thought, but it resonates well with me.

@pfalcon
Copy link
Author

pfalcon commented Jun 4, 2015

For the record, it is very much appreciated!

Thanks. I'm happy to dump my thoughts as long as I work on that stuff, and I don't want to drop the ball until until I've brushed up what I already have (or it likely will be "lost"). Btw, I've come up with a plan for my code - it will be standalone tool, so people could easily hack on it without dependencies, but also should be usable as a plugin for scratchabit. So, I started a repo: https://github.com/pfalcon/ScratchABlock (little interesting there so far).

@mewmew
Copy link
Member

mewmew commented Jun 4, 2015

So, I started a repo: https://github.com/pfalcon/ScratchABlock (little interesting there so far).

I have been following ("watching" in GitHub terminology) this repo since a few days back :)

@pfalcon
Copy link
Author

pfalcon commented Jun 23, 2015

FYI, I flushed my previous developments to https://github.com/pfalcon/ScratchABlock , with tests.

See also https://github.com/EiNSTeiN-/decompiler/issues/11#issuecomment-112245316 for another (simple) example where simple graph isomorphism won't work.

@mewmew
Copy link
Member

mewmew commented Jun 27, 2015

FYI, I flushed my previous developments to https://github.com/pfalcon/ScratchABlock , with tests.

Thanks for the notice! I've been tracking your commits for some time. When I get a chance, I'll take a deeper look and play around with it.

See also EiNSTeiN-/decompiler#11 (comment) for another (simple) example where simple graph isomorphism won't work.

Thank for the example. It is true that the assembly representation of simple while loops often consists of an if statement with a do-while body. Normalization of the CFG through isomorphic transformations may help mitigate this issue. I would still like to continue investigating how intervals may affect control flow analysis, as they should be able to solve the issue of nested breaks and continues in loops.

@pfalcon
Copy link
Author

pfalcon commented Jun 28, 2015

I would still like to continue investigating how intervals may affect control flow analysis, as they should be able to solve the issue of nested breaks and continues in loops.

By all means, please do. I in the meantime started generic loop recognition based on generalized loop graph structure (surface structure, without calling for "deep" devices such as intervals), it will be interesting to compare results later.

@mewmew
Copy link
Member

mewmew commented Jun 29, 2015

By all means, please do. I in the meantime started generic loop recognition based on generalized loop graph structure (surface structure, without calling for "deep" devices such as intervals), it will be interesting to compare results later.

Wonderful. It will indeed be interesting to compare the results!

@mewmew mewmew added this to the v0.2 milestone May 25, 2016
@mewmew mewmew changed the title (Literal) graph isomorphism is not viable device for CFG matching cmd/restructure: (Literal) graph isomorphism is not viable device for CFG matching May 25, 2016
@mewmew mewmew modified the milestones: v0.4, v0.2 Mar 23, 2017
@mewmew mewmew modified the milestones: v0.4, v0.3 Jan 30, 2018
@mewmew
Copy link
Member

mewmew commented Jan 30, 2019

By all means, please do. I in the meantime started generic loop recognition based on generalized loop graph structure (surface structure, without calling for "deep" devices such as intervals), it will be interesting to compare results later.

Hi @pfalcon!

Now I have the implementation complete. Haha, sure took some time. Would you still be interested in comparing results? I most definitely would. If so let me know.

These are the results I get using the Hammock method (subgraph isomorphism) and Cifuentes' Interval method, when running on the test programs of GNU Coreutils (version 8.30) and SQLite (version 3.25.0).

The results are to be interpreted as follows.

  • true positive: control flow primitive recovered and present in the original source code.
  • false positive: control flow primitive recovered but not present in the original source code.
  • false negative: control flow primitive present in the original source code but not recovered.

results_combined

The Hammock method correctly recovered 42.16% of the control flow primitives of the test programs (true positives). On average, for every 10 control flow primitives of the original source code, the Hammock method recovered 3.5 control flow primitives that were not present of the original source code (false positives).

The Interval method correctly recovered 74.28% of the control flow primitives of the test programs (true positives). On average, for every 10 control flow primitives of the original source code, the Interval method recovered 5.9 control flow primitives that were not present of the original source code (false positives).

Cheers,
Robin

@pfalcon
Copy link
Author

pfalcon commented Feb 3, 2019

@mewmew: Congrat on the milestone! Cute graphs ;-).

Well, I'm not sure how much it would be possible to compare. I believe that generic loop structuring algo is still in my git stash, unfinished. And my ScratcABlock is definitely not up to decompiling GNU Coreutils. (In the sense that it may produce some output, but I wouldn't be brave enough to wade thru it ;-) ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants