Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge local variables and parameter declarations when possible #188

Open
frabert opened this issue Oct 19, 2021 · 5 comments
Open

Merge local variables and parameter declarations when possible #188

frabert opened this issue Oct 19, 2021 · 5 comments
Labels
decomp Related to LLVM IR to C decompiler enhancement New feature or request user-story

Comments

@frabert
Copy link
Collaborator

frabert commented Oct 19, 2021

Rellic generates temporary local variables to store parameter values, instead of directly using the parameters themselves:

int foo(int a, int b, int c) {
  a = a * 2;
  return a * b + c;
}

roundtrips to

unsigned int foo(unsigned int a, unsigned int b, unsigned int c);
unsigned int foo(unsigned int a, unsigned int b, unsigned int c) {
    unsigned int a_var0;
    unsigned int b_var1;
    unsigned int c_var2;
    a_var0 = a;
    b_var1 = b;
    c_var2 = c;
    a_var0 = a_var0 * 2U;
    return a_var0 * b_var1 + c_var2;
}

It would be nice if we could detect when variables such as a_var0, b_var1 and c_var2 are redundant and could be substituted with the parameters instead.

@frabert frabert added enhancement New feature or request decomp Related to LLVM IR to C decompiler user-story labels Oct 19, 2021
@surovic
Copy link
Contributor

surovic commented Oct 19, 2021

How does the input LLVM IR look like? We need to remember that we're not in the business of roundtripping C code. Rellic produces C from LLVM IR in a way that captures the semantics of LLVM IR as closely as feasible.

@frabert
Copy link
Collaborator Author

frabert commented Oct 19, 2021

Rellic produces C from LLVM IR in a way that captures the semantics of LLVM IR as closely as feasible.

Uh I think we need to hear @pgoodman about this particular point: I was under the impression he wants to produce the "most beautiful" C code we can from the IR while keeping the semantics the same

@pgoodman
Copy link
Contributor

@surovic we are also in the business of roundtripping C code. That has always been an explicit goal for Rellic. We can talk on Slack about the use-cases. The future of anvill-produced bitcode may also include LLVM debug information metadata, and so gaining first-class support is key.

@surovic
Copy link
Contributor

surovic commented Oct 20, 2021

The future of anvill-produced bitcode may also include LLVM debug information metadata, and so gaining first-class support is key.

Agreed.

My point mainly here is that if we get input like the following:

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @foo(i32 %a, i32 %b, i32 %c) #0 {
entry:
  %a.addr = alloca i32, align 4
  %b.addr = alloca i32, align 4
  %c.addr = alloca i32, align 4
  store i32 %a, i32* %a.addr, align 4
  store i32 %b, i32* %b.addr, align 4
  store i32 %c, i32* %c.addr, align 4
  %0 = load i32, i32* %a.addr, align 4
  %mul = mul nsw i32 %0, 2
  store i32 %mul, i32* %a.addr, align 4
  %1 = load i32, i32* %a.addr, align 4
  %2 = load i32, i32* %b.addr, align 4
  %mul1 = mul nsw i32 %1, %2
  %3 = load i32, i32* %c.addr, align 4
  %add = add nsw i32 %mul1, %3
  ret i32 %add
}

I think we should emit C that's true to the bitcode, if we can, instead of doing transformations that the user might not expect or want. I also think that the suggested transformation is more easily done by running an opt LLVM pass, like we do with switch lowering. To illustrate the point:

unsigned int foo(unsigned int a, unsigned int b, unsigned int c);
unsigned int foo(unsigned int a, unsigned int b, unsigned int c) {
    return (a << 1U) * b + c;
}

This is rellic output for the above C source compiled with O1.

@pgoodman
Copy link
Contributor

pgoodman commented Oct 20, 2021

The way I see Rellic is similar but different. Instead of "true to the bitcode," what I want is "if compiled, could be proven equivalent to the bitcode." There is thus a spectrum of of outputs that meets this criteria.

On one end, which would be truest to the bitcode, we have "literal translation of bitcode to stuff with gotos and labels and what not." Somewhere near the middle, we have what Rellic does, which frankly isn't true to the bitcode at all -- it has restructured things in a dramatic way, and via z3, possibly altered the complexity or nature of conditions and other expressions. z3's machinations are opaque. Then further toward the other end of the spectrum, we bring in more and more high-level features, be they names or types or actual language features/syntaxes.

Then there's the bitcode itself. Why is the bitcode the way it is? Sometimes it is full of crap that spills out of LLVM optimization passes, e.g. naming variables with "sroa.blah" prefixes. LLVM variable names are, imo, not a trusted or reliable source of information, but in the absence of something better (debug info) is what we are stuck with. If we have debug info, we ought use it. Other times, bitcode is leaking the myriad order- and target-specific decisions of codegen/instcombine. Why did the bitcode do a big mask here when it could have been a trunc and zext? Well, maybe because it's over-eagerly optimizing for operating on register-sized values because the target architecture will have a pipeline stall if you flip flop between full sized and sub-registers in straight-line code. The consumer of rellic shouldn't be subject to this.

TL;DR: we should accept that we're not true to the bitcode already, and that the bitcode isn't necessarily something that we want to be true to -- just equivalent. Heck, in anvill, we often try to change the bitcode for Rellic's benefit.

I think we should strive for "idiomatic" code, which is itself a wishy-washy target. How do we pin down what idiomatic code is? We look at the output, we make a decision saying "It'd be more comprehensible if it looked this way" and then we make a change. Sometimes those opinions on "what is more comprehensible" are going to raise conflicts -- conflicts that are the same depth as tabs vs. spaces. In those cases, we find a client and they decide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decomp Related to LLVM IR to C decompiler enhancement New feature or request user-story
Projects
None yet
Development

No branches or pull requests

3 participants