Detecting Valgrind

2012-09-17 | Ahmed Bougacha

After our static analysis excursion, we worked (and are still working) on dynamic binary analysis through instrumentation.

Valgrind does that, how?

For most people, valgrind detects memory access errors, double frees, etc.. It is actually a framework for dynamic binary instrumentation. There are several tools in the official release, even more in the wild, and Memcheck, the default one that does the memory checking, is what people usually call valgrind.

Valgrind (and even moreso QEMU), when running a program, actually emulates a CPU, by doing binary translation:

The unit of code is the basic block: other blocks jump into it at the first instruction only, and it has a jump as the last instruction. Basically a block of sequential instructions.
When first jumping on a basic block, valgrind translates the basic block into the machine-independent, IR, VEX.
The IR is in Static Single Assignment form. Each variable (in VEX, called "temporaries") is assigned a value only once, at its creation.
During the translation, it calls the current tool's instrumentation function, that could for example add checks before each LOAD/STORE in the form of a function call. The instrumentation is done on the flattened IR. Flattened means that all expressions are "atoms", either constants or temporaries. Basically, you don't care about handling nested operations like
```
Mul32(Add32(LOAD(0xd00d), 1), 2)
```
, but instead only about simple operations:
```
t1 = LOAD(0xd00d)
t2 = Add32(t1, 1)
t3 = Mul32(t2, 1)
```
The instrumented code is translated the other way into the target architecture.
Host architecture registers are represented as a block of memory, with an offset for each register.
The addresses that the original code uses are different to the ones that the translated code uses. The translated code is stored in another region of memory, and the scheduler jumps between the different translated code locations.

There are lots of brilliant tricks in both valgrind and QEMU. An open project of LLVM that could be fun is replacing valgrind's VEX by LLVM IR. Maybe not replacing, but adding another stage after instrumentation that uses LLVM's optimization passes and code generation. Someone has already tried doing that for QEMU, with mixed results. The difference is that QEMU's main objective is to run fast, not valgrind.

Now, there isn't a lot of security-oriented valgrind tools. However, dynamic binary instrumentation is really promising.
Nikita Tarakanov explained at HITB2012AMS how taint analysis was used to automatically find vulnerabilities (We have lots of article in the pipeline, we'll soon talk about taint analysis.)
They used BitBlaze, a binary analysis framework focused on security. For our own vulnerability-finding, taint-analyzing tool, we couldn't use BitBlaze. That's a shame really, because it seems awesome. BitBlaze has 3 parts:

Vine: static analysis. For that we have some python and a regular x86 disassembler. Vine has the huge advantage of translating to an instruction set that's more easily manageable than x86 (let's not forget about repz ret!). We thought about using VEX for the same purpose, which would make the whole quite coherent.
TEMU: dynamic analysis, basically QEMU with taint analysis. We use valgrind. QEMU has the lesser advantage of also emulating complete systems, making it possible to perform taint-analysis on operating systems.
Rudder: symbolic execution: we are investigating the z3 theorem prover

Remember, all this is about searching for vulnerabilities. At the time we began having results, I assisted to the LSE Summer week (French), which had a crackme. And then it came to me. Why not use taint analysis and the symbolic execution to defeat crackmes?
As it turns out, @delroth_, the crackme's author, wrote about it some time later.

We thought for a while about how this could be made harder. How about adding code in the crackme to detect valgrind? This can be done for virtual machines. But there are several things that betray valgrind.

First, there are several instructions that are not implemented.
Some time ago, we had to write a libc using x86-64 assembly only. I tried to have fun with the instruction set. That's how you come to use enter.
The Intel reference says it is used for nested procedure stack frames, in block-structured languages. C doesn't have nested procedures; Algol, Pascal, among others, do.
enter is a mess.
I don't think I ever saw it being generated by a C compiler.

Quick aside: actually, gcc does support nested procedures as an extension. Does it generate enter? Sure doesn't look so.

I was saying that I used enter in the prologue of some C library functions. I tried running my test program under valgrind. Valgrind didn't like it, SIGILL.

To detect valgrind, we wrote some code that had an (inline assembly) enter, and a SIGILL handler. If valgrind tried to translate the enter, it signaled the process, running our handler. The handler then checked the address and printed a snarky message. Else, everything was OK. This is some fine crackme material.

We have valgrind, what about adding some basic enter support. There once was enter support, that has since been dropped.
Julian Seward, the creator, closing a bug report complaining about enter:

Why should we add support for an instruction that nobody uses?

Second, valgrind has a feature called client request. Basically, a program running under valgrind can insert specific multi-instruction no-ops that have a special meaning to valgrind, but otherwise don't do anything.
This is a simple one to circumvent, just remove the relevant call.

Third, packers, self-modifying code, and similar mechanisms generate code dynamically.
Valgrind marks all pages containing code that has been translated read-only. When a write occurs, it checks if the write changed already translated code. If it did, it invalidates the stored translated code. The next time someone jumps over the modified code, it is translated once again.

On some architectures, the CPU caches have to be handled manually. When code is modified, the instruction cache has to be reloaded. On PowerPC for example, there are no less than 6 instructions to execute. This makes it easy to detect self-modifying code without resorting to read-only pages.

Last, another anti-reverse technique is jumping into the middle of instructions.
Due to the way valgrind does the translation, this is not a problem. Actually valgrind does not even try to be smart about this.
To take a simpler example, if you have the following asm:

jmp bb_start

bb_start:
    xor al, al
target:
    add al, 1
    cmp al, 0x66
    je target
bb_end:

Valgrind will first, when executing the jmp, translate the basic block starting at bb_start and ending at bb_end, and then, when executing je, translate the basic block from target to bb_end.
We see that even with jumps in the middle of basic blocks, valgrind is quite conservative and retranslates the remaining instructions. As to jumps in the middle of instructions, Jurriaan Bremer has posted a great example in the comments. This is handled in the same way by valgrind.

There is a lot of interesting things to say about valgrind's basic block dispatch mechanism, I'll try to write about it some day!