Design: Function #1

m4b · 2017-08-21T05:32:01Z

Intro

/cc @endeav0r @flanfly @XVilka @ZhangZhuoSJTU

The primary reason for a Function is to:

Contain the intermediate language (IL)
Contain the structure of the cfg
Serve as input to a client for further analysis

Consideration 1: Trait or Struct?

we could represent a Function in our shared framework as either a trait:

trait Function<IL> {
    fn entry(&self) -> u64;
    // size in bytes
    fn size(&self) -> usize;
    fn instruction() -> Iterator<Item=&IL>;
    fn mnenonics(&self) -> Iterator<Item=&Mnemonic>;
    fn basic_blocks(&self) -> Iterator<Item=&BasicBlock>;
    fn cflow_graph(&self) -> Graph<Node,Guard>
    ... etc
}

or as a struct containing the IL:

pub struct Function<IL> {
    pub name: Cow<'static,str>,
    entry_point: u64,
    instructions: Vec<IL>, // in reverse postorder
    basic_blocks: Vec<BasicBlock>,
    mnemonics: Vec<Mnemonic>,
}

There are advantages to both approaches. At a later date, I or others can extrapolate on these advantages for each

Consideration 2: Disassembler

A function cannot really exist without a disassembler backend.

We should strive to be flexible. E.g., it is very unlikely that panopticon or falcon or radare will swap out their disassembler pipelines; and that is fine.

Nevertheless, this means we need to:

Understand the disassembler pipeline in a robust way
Represent this ergonomically
Represent this unobtrusively

I consider this to be one of the major difficulties in the design of the Function, for again, reasons I won't get into at the moment.

Consideration 3: Efficiency and Ergonomics

There is always the danger of premature optimization; in this case however, I will argue there isn't anything premature at all.

Panopticon (and falcon) (and i'm sure radare) have already prototyped function implementations. In the course of doing so, in panopticon at least, after working with it for a while, several things became clear:

Iterating the cfg directly is painful (ergonomically)
Allocating basic blocks separately is painful (to the processor)
A natural representation of an IL is not scalable (to ram)

I don't see the point of refactoring panopticon's and friends' function representations to simply share them; while this is an interesting goal, it does not justify the amount of effort in and of itself. Sharing an efficient and ergonomic implementation however, very much so would.

Consequently, I argue that 1-3 should be first and foremost in consideration when designing Function - our data representation, if successful, will serve as a foundation and building block for the rest of our applications, so its crucial that we get it right, extensible, maintainable, ergonomic and efficient.

Ergonomics + Iterators

In my experience, many aspects of Function ergonomics is captured by iterators, as written above - iterators over the collected IL, iterators over the basic blocks, and iterators over the mnemonics. Not only that, the code becomes "elegant", and easier to read.

This is something I do very frequently:

        for bb in function.basic_blocks() {
            for statement in bb.statements() {
                match statement {

and it makes me happy.

Using Iterators I believe is ergonomic and efficient; it allows the client to perform processor friendly iterations, and it allows the client to collect the iterator if they need more control, etc.

Lastly, with an eye to the future, once impl Trait lands in rust, not only will unboxed, complicated iterator chains be possible as a means to extend the Function api, they will be efficient and easy to write.

So something complicated like this (probably) won't allocate, once we can do impl Trait:

    /// Return a boxed iterator over every statement in this function
    pub fn statements<'b>(&'b self) -> Box<Iterator<Item=&'b Statement> + 'b> {
        Box::new(self.basic_blocks().map(|bb| bb.statements()).flat_map(|ss| ss))
    }

Efficiency - contiguous allocations

This is fairly straightforward; but an implementation should strive to keep all IL, etc., in a single contiguous vector or boxed container of some sort. A typical implementation (and quite naturally) has basic blocks contain their own statement, mnemonic, etc. allocations. While this is easier to deal with, it isn't as cache friendly as we'd like. I also think it will improve general performance because the allocator won't be as taxed, but this is only a hunch, needs numbers to back this up.

Efficiency - IL statement explosion problem + binary IL encoding

This will be likely perhaps more controversial, and definitely more difficult to implement, but however, this gets solved or mitigated, it needs to, as it makes analysis of larger binaries essentially unscalable.

To illustrate here are some stats from panopticon + libc, but it boils down to this equation:

sizeof Rvalue: 72
sizeof Lvalue: 64
sizeof Mnemonic: 112
sizeof Operation: 160
sizeof Statement: 224 (panopticon's IL)
avg size in bytes of a function: 1003.6297
total functions 3208
avg #mnemonics per function: 592.8117
avg #statements per function: 1877.2189

224 * 3000 * 1800 = 1209600000 = 1.2GB

So fully analyzing libc (a 2MB binary) takes approximately 1.2GB of memory.

There are several ways to fix this:

reduce number of statements (i.e., increase complexity of IL)
reduce sizeof statement
don't analyze all of libc

I believe all 3 are acceptable and possible targets :D

So first, technically speaking, if we do this crate right, 1. actually is a possibility since Function will let us experiment with different IL; it could in principle let me define a complex IL, specify that I want to use it, and then i'm off to the races. (how exactly that would work will be matter for Design: IL :D)

Currently, I think 2. should be apart of the design consideration immediately when working on Function; I say this because it will imply semi-radical divergences for how we represent and hold the IL. It may end up being infeasible, or will require figuring out some other things first, but I believe it is a worthy goal.

For reference, here is a very simple and hastily written binary encoding for RREIL: das-labor/panopticon#313 (comment)

This reduces the sizeof to 27 bytes, which makes larger analyses possible (but only postpones the problem for very large binaries).

Unfortunately, doing so has some major problems, especially for phi statements, and string references.
The second requirement is that some kind of StringTable will now need to be present.

The first (phi statements) are the problem that variable length instructions/statements pose; this is solvable by moving from a fixed-width encoding to a variable length one (which I suspect will result in massive amortized space savings (but again, need numbers)), but I believe this will force a Vec<IL> to be unrealizable (as the binary encoding is essentially unsized now).

A further problem is the tension with ergonomics that this poses; it would force all IL to have some kind of IL: BinaryEncoding bound, and force them to implement this if they want to be contained in Function, which isn't necessarily bad, but it does complicate matters.

In particular, I'd like to hear what everyone's thoughts are on this matter - whether a binary encoding bound is acceptable, what their IL's encoding might look like, etc.

Having this as a design target is the easiest way to get a solid foundational, shareable data structure(s) we can all use that will permit more sophisticated analyses running on larger binary targets, which then gives us room to grow :)

Lastly, as for 3., this goes into a property I'd also like to see worked into for Functions if possible, is a lazy statement generator.

If the IL isn't constructed eagerly, but only upon first request, this could help relieve some of the pressure on running a tool on a large binary.

How this might work, in Rust implementation wise, is another story however, and perhaps best left tabled for the moment.

Fin

Anyway, this was a lot, sorry! Just trying to get the ball rolling.

Please take your time, don't feel rushed, and lets have some fun!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: Function #1

Design: Function #1

m4b commented Aug 21, 2017

Design: Function #1

Design: Function #1

Comments

m4b commented Aug 21, 2017

Intro

Consideration 1: Trait or Struct?

Consideration 2: Disassembler

Consideration 3: Efficiency and Ergonomics

Ergonomics + Iterators

Efficiency - contiguous allocations

Efficiency - IL statement explosion problem + binary IL encoding

Fin