The rep prefix and detecting valgrind

2012-09-20

As a comment to our last post, @elvanderb posted some code that could be used to detect valgrind.

The code is really interesting, but something confused me at first: the comment indicates that the two stored bytes, F3 AA, are rep movsb. Now I don't know the opcodes of each instruction, but what the inline asm does seems to do fits more with a stosb/scasb: put a value in al, and loop over some memory. AA is actually stosb.

For reference, here is the code:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/mman.h>

int main()
{
        uint8_t* page;
        uint32_t nb_written = 0xFFFFFFFF;

        page = mmap(NULL, 0x1000, PROT_EXEC | PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (! page)
        {
                fprintf(stderr, "mmap failed\n");
                return 1;
        }
    // 0x803 is in the middle of the cache to ensure that it'll be overwritten by the rep movsb
        page[0x803] = 0xF3; // rep prefix
        page[0x804] = 0xAA; // movsb
        __asm__("movq %1, %%rdx ;"
                "movq  %%rdx, %%rdi ;"
                "addq $0x803, %%rdx ;"
                "xor  %%ecx, %%ecx ;"
                "dec  %%ecx ;"
                "movb $0xC3, %%al ;"
                "callq *%%rdx ;"
                "movl %%ecx, %0 ;"
                : "=r"(nb_written)
                : "r"(page)
                : "%eax", "%edi", "%ecx", "%edx");
        nb_written = 0xFFFFFFFF - nb_written;
        printf("Nb bytes written : %08X\n", nb_written);
        if (nb_written <= 0x804)
        {
                printf("Probably monitored by valgrind\n");
                return 1;
        }
        printf("No valgrind detected\n");
        return 0;
}

However, I couldn't get it to work: depending on the machine, it either always reports valgrind, or always segfaults. There are several reasons why it should work but doesn't. @elvanderb, I'm all ears!

For those who can't stand gcc inline asm, what it does is mmap some memory, put a rep stosb (store ecx times al in memory, starting at rdi) in it, and jump to the rep stosb, after having set rdi to the start address of the mmaped memory, ecx to 0xFFFFFFFF, and al to C3, rets opcode.

That way, the rep movsb should have filled the mmaped memory with rets. What the program does is check that the rep movsb stops as soon as it overwrites itself. If it did, then it prints that the program was probably ran under valgrind. But why should it continue if there is no valgrind? When does the rep stop?

A naively correct answer would be "when the executed instruction is overwritten". And indeed, that's what emulators and translators like valgrind do. With the latter, there is no such thing as a rep prefix, so this is done using conditional jumps, that GET ecx, compare it with 0, etc.. When the instruction is overwritten, valgrind has to retranslate the basic block, and it immediately stops executing the rep movsb.

Now what exactly is done by CPUs?
If you take an Intel reference, here is the pseudocode for rep:

WHILE CountReg != 0
DO
        Service pending interrupts (if any);
        Execute associated string instruction;
        CountReg <-(CountReg - 1);
        IF CountReg = 0
                THEN exit WHILE loop; FI;
        IF (Repeat prefix is REPZ or REPE) and (ZF = 0)
        or (Repeat prefix is REPNZ or REPNE) and (ZF = 1)
                THEN exit WHILE loop; FI;
OD;

Note that when an interrupt is raised, it is handled at the next iteration. This works because the string operations update the registers they need (rcx,rdi,rsi,..), so this:

mov rdi, 0x12344321
mov al, 0x90

mov rcx, 50
rep stosb

should be equivalent to this:

mov rdi, 0x12344321
mov al, 0x90

mov rcx, 25
rep stosb
mov rcx, 25
rep stosb

The rest of the time, the instruction is executed, and all is well.

How does rep work? Looking some more around the Intel reference (the low-level hacker's best friend!), there are paragraphs that are quite interesting.

First, Vol.3 11.6, gives advice on how to handle Self-modifying code.

On pipelined CPUs, as an optimization, the prefetch queue (the PIQ, for Prefetch Input Queue; warning: old article) is used to store instructions that have already been fetched. The PIQ is flushed at jumps or interrupts.
Back in the olden days (Intel486), this was not transparent: it was quite easy to find out what the PIQ's size is. If it was 0, then no PIQ was being used, so it seemed likely that the CPU was emulated or traced: the trick was about overwriting the next few instructions, and seing where it stops executing these. An anti-debugging technique was to use this kind of things to detect single-stepping, as at each step, the PIQ is flushed (because of the generated interrupt).

With the Pentium, Intel fixed this, and made overwriting prefetched instructions invalidate the PIQ.
Some claim that rep movs/stos were still problematic because they were kept in the PIQ for the duration of the operation, though I couldn't find any source.

But there is another problem with these instructions.
Vol.1 7.3.9.3, Fast-String Operation, explains that the eponymous feature is an automatic optimization of the string operation. When all conditions (unspecified; regarding alignment, counter value (ecx), etc..) are met, the movs/stos can be optimized, in an again unspecified manner, by the processor. Basically, it amounts to SIMD-izing 9 the stores, for instance storing in groups of 16 bytes at a time. There are two major consequences:

The interrupts are only handled between group boundaries (in our example the interrupts are no longer serviced after 1 byte is stored, but after 16).
The memory model is such that stores may be reordered as seen fit by the CPU, as explained by Vol.3 8.2.4, Fast-String Operation and Out-of-Order Stores: the string operation's stores are reordered, but only within the string operation. Any other store will be seen as normally ordered relative to the string operations.

What's more, in the optimization manual, 3.7.7 Enhanced REP MOVSB and STOSB operation (ERMSB) says that starting with Ivy Bridge CPUs, string operations are even faster, especially for lengths that are multiples of 64. The given example is that copying 65-128 bytes take 40 cycles, while copying 128 bytes needs only 35 cycles.
This makes optimizing by hand almost vain:

larger-sized string instructions (for example stosd instead of stosb) are slower.
SIMD (SSE, AVX, ..) is costlier in terms of code size and complexity: take a look at Apple's memset, that once supposedly ran much faster than glibc.

What's funny is that not so long ago, optimizing string operations in hardware was seen as a waste because far too costly, and far easier to do using carefully hand-written assembly.

But let us get back to detecting valgrind. With these SIMD-like optimizations, even with the memory model guaranteeing that an overwritten instruction will do the needed invalidation, the rep stosb will be overwritten at the same time that the whole group of 64 of 128 bytes have been written.

I think that could also be used to detect emulators, but that's just a wild guess, that may even be proven wrong by the fact that my CPU (Arrandale corei7) stops exactly when the rep stosb is overwritten, making the program detect valgrind when it should not.

Bonus fact: even though the rep prefix is supposedly only defined with string instructions, several instructions actually have a mandatory rep/repne prefix, mostly SSE stuff, to distinguish between variants. However, pause (used as a hint that we are busy-waiting, for instance in spinlocks) is encoded as F3 90 and is actually equivalent to rep nop and will be handled as a special case. If I recall correctly, objdump doesn't know about pause, and simply outputs rep nop.

Single Instruction Multiple Data, stands for instructions that do the same operation on several inputs at once. For instance, using SSE, one can multiply 4 floats using a single instruction