A trek through zpoline (part 1)
So, HN front page led to me to an interesting project: bpftime. It's an userspace ebpf runtime for hooking syscalls and uprobes! Think about that for a second. As complicated as it is, it builds ontwo underlying technologies, one to track userspace functions and another for syscalls.
The second one, is something called zpoline: a paper from Usenix ATC'23. The aim of this work is to find an efficient, fast, complete solution to syscall hooking - which it does entirely in userspace using binary rewriting. I was simply blown away by the simplicity and excellence of this work - this is the most exciting paper I have read in a while. In fact, this paper has won the best paper award - which is totally deserving.
In this blog, I am going to take a trek through this paper. The aim is not to explain the paper itself - you can read the original paper - it may be one of the simplest papers in existance, if you have the pre-requisite knowledge. This is not a teardown or walkthrough either - since I don't have the time to pursue a full implementation, unfortunately. Instead, I will simply call out interesting aspects of the work, point out parts of the code and do some minimal hands-on in the same space of this paper. I aim to over-explain in this blog, so that beginners to these concepts may be able to follow.
1. Examining the syscall instruction
One of the core pillars of this paper is that syscalls are instructions in the binary which take up 2 bytes. The challenge addressed in this paper is being able to efficiently use these 2 bytes - naive approaches to replace the syscall call take up more than 8 bytes.
I actually didn't know/realize that syscall calls were a part of the instruction set. Sure enough, looking it up on https://www.felixcloutier.com/x86/ does show it. But, I want to see it for myself.
Let us start with a simple C program. Since, the paper mentions "read" being the first syscall, I wanted to find this out.
#include<stdio.h> #include<unistd.h> void main() { char c; c = getc(stdin); printf("Value: %c\n", c); }
Great, you can compile it using gcc:
$ gcc simple.c
and run it as usual ./a.out
at which point you have simple program that works as expected.
If you examine the binary using:
$ objdump -d a.out | less
What is the problem? The problem is that you will not find the "syscall" instruction anywhere, since we are simply calling a standard C library function "getc".
In the disassmbled binary, searching for main, will give you a section like the following:
0000000000400616 <main>: 400616: 55 push %rbp 400617: 48 89 e5 mov %rsp,%rbp 40061a: 48 83 ec 10 sub $0x10,%rsp 40061e: 48 8b 05 0b 0a 20 00 mov 0x200a0b(%rip),%rax # 601030 <stdin@@GLIBC_2.2.5> 400625: 48 89 c7 mov %rax,%rdi 400628: e8 f3 fe ff ff callq 400520 <getc@plt> 40062d: 88 45 ff mov %al,-0x1(%rbp) 400630: 0f be 45 ff movsbl -0x1(%rbp),%eax 400634: 89 c6 mov %eax,%esi 400636: bf e8 06 40 00 mov $0x4006e8,%edi 40063b: b8 00 00 00 00 mov $0x0,%eax 400640: e8 cb fe ff ff callq 400510 <printf@plt>
The calls to getc (and printf) - the "callq" lines refer to other addresses in the same binary. Let us look at them, in a separate section called plt:
Disassembly of section .plt: 0000000000400500 <.plt>: 400500: ff 35 02 0b 20 00 pushq 0x200b02(%rip) # 601008 <_GLOBAL_OFFSET_TABLE_+0x8> 400506: ff 25 04 0b 20 00 jmpq *0x200b04(%rip) # 601010 <_GLOBAL_OFFSET_TABLE_+0x10> 40050c: 0f 1f 40 00 nopl 0x0(%rax) 0000000000400510 <printf@plt>: 400510: ff 25 02 0b 20 00 jmpq *0x200b02(%rip) # 601018 <printf@GLIBC_2.2.5> 400516: 68 00 00 00 00 pushq $0x0 40051b: e9 e0 ff ff ff jmpq 400500 <.plt> 0000000000400520 <getc@plt>: 400520: ff 25 fa 0a 20 00 jmpq *0x200afa(%rip) # 601020 <getc@GLIBC_2.2.5> 400526: 68 01 00 00 00 pushq $0x1 40052b: e9 d0 ff ff ff jmpq 400500 <.plt>
PLT stands for Procedure Linkage Table, along with GOT - Global Object Table allows external functions (in this case from the standard library) to be linked dynamically. Understandably, we will never get to our syscall this way.
Let us do 2 things:
- Switch to the read stdlib call, instead of getc. Why? Just to remove one layer of indirection.
- Statically compile our program. This is not normally done with C, but this is the only way to get the binary of our dependencies into a place where we can actually inspect them.
For the first, we simply replace the getc line with the following:
read(0, &c, 1);
Note: this read is not the read system call. It is a C std library function that wraps the system call. In the man pages, section 2 is devoted to system calls and section 3 is to standard POSIX function. So simply running man 2 read
or man 3 read
will get you the corresponding help pages.
Second, we need to statically compile the C program. For this, first you need get the static versions of the stdlib. On a RHEL, it is the following:
$ sudo dnf install glibc-static
Then, you can compile the program as follows:
$ gcc -static simple.c
If you notice the created binary, it will be much larger than before. Dissamble as usual, the output will also be quite large, with all sorts of un-needed functions being included. Not neat, but works for us.
Now, let's follow main.
0000000000400ac5 <main>: 400ac5: 55 push %rbp 400ac6: 48 89 e5 mov %rsp,%rbp 400ac9: 48 83 ec 10 sub $0x10,%rsp 400acd: 48 8d 45 ff lea -0x1(%rbp),%rax 400ad1: ba 01 00 00 00 mov $0x1,%edx 400ad6: 48 89 c6 mov %rax,%rsi 400ad9: bf 00 00 00 00 mov $0x0,%edi 400ade: e8 ad cb 03 00 callq 43d690 <__libc_read> 400ae3: 0f b6 45 ff movzbl -0x1(%rbp),%eax 400ae7: 0f be c0 movsbl %al,%eax 400aea: 89 c6 mov %eax,%esi 400aec: bf 50 e9 47 00 mov $0x47e950,%edi 400af1: b8 00 00 00 00 mov $0x0,%eax 400af6: e8 a5 81 00 00 callq 408ca0 <_IO_printf> 400afb: 90 nop 400afc: c9 leaveq 400afd: c3 retq 400afe: 66 90 xchg %ax,%ax
Pretty similar to last time, except there is no reference to PLT now. This time, we are going to go the symbol __libc_read
, cutting out the middleman getc
.
000000000043d690 <__libc_read>: 43d690: f3 0f 1e fa endbr64 43d694: 8b 05 f6 d1 26 00 mov 0x26d1f6(%rip),%eax # 6aa890 <__libc_multiple_threads> 43d69a: 85 c0 test %eax,%eax 43d69c: 75 12 jne 43d6b0 <__libc_read+0x20> 43d69e: 31 c0 xor %eax,%eax 43d6a0: 0f 05 syscall 43d6a2: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax 43d6a8: 77 56 ja 43d700 <__libc_read+0x70> 43d6aa: c3 retq 43d6ab: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 43d6b0: 41 54 push %r12 43d6b2: 49 89 d4 mov %rdx,%r12 43d6b5: 55 push %rbp 43d6b6: 48 89 f5 mov %rsi,%rbp 43d6b9: 53 push %rbx 43d6ba: 89 fb mov %edi,%ebx 43d6bc: 48 83 ec 10 sub $0x10,%rsp 43d6c0: e8 4b 3a 02 00 callq 461110 <__libc_enable_asynccancel> 43d6c5: 4c 89 e2 mov %r12,%rdx 43d6c8: 48 89 ee mov %rbp,%rsi 43d6cb: 89 df mov %ebx,%edi 43d6cd: 41 89 c0 mov %eax,%r8d 43d6d0: 31 c0 xor %eax,%eax 43d6d2: 0f 05 syscall 43d6d4: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax 43d6da: 77 38 ja 43d714 <__libc_read+0x84> 43d6dc: 44 89 c7 mov %r8d,%edi 43d6df: 48 89 44 24 08 mov %rax,0x8(%rsp) ... lines truncated
A number of things are happening here, which I don't fully understand, but the first few lines directly show what we came for.
... 43d69e: 31 c0 xor %eax,%eax 43d6a0: 0f 05 syscall ...
XOR'ing eax sets it to 0. Then syscall, the instruction in binary being 0f 05
, is called. Exactly like the paper says.
Now, the paper said that read syscall is 0, which is as seen here. But, we can checkout all syscall numbers by looking at "sys/syscall.h" header file. The file points to another file, a few redirects later, we settle on /usr/include/asm/unistd_64.h
on my system. This looks like the following:
#ifndef _ASM_X86_UNISTD_64_H #define _ASM_X86_UNISTD_64_H 1 #define __NR_read 0 #define __NR_write 1 #define __NR_open 2 #define __NR_close 3 #define __NR_stat 4 #define __NR_fstat 5 #define __NR_lstat 6 ...
There you go. Syscall numbers as promised, running to 439 in my case.
Note: You may have heard of interrupts using 0x80 as a way of doing syscalls. Apparently that is part of the 32 bit ABI and is pretty much deprecated with the x86-64 ABI. The new alternative is what we see here, the syscall instruction.
2. LDPRELOAD Library Overriding
I am going to be brief here and simply point to an existing article on this topic: https://www.baeldung.com/linux/ld_preload-trick-what-is.
The point is, you can easily, dynamically, load libraries before your binary is run and you can also override internal functions. Let us see a simple example, lifted from this Stack Overflow QA: https://stackoverflow.com/questions/6083337/overriding-malloc-using-the-ld-preload-mechanism
Create a "malloc.c" that looks like this:
#define _GNU_SOURCE // needed for RTLD_NEXT constant, see man dlsym #include <stdio.h> #include <dlfcn.h> // gets you dlsym static void* (*real_malloc)(size_t)=NULL; static void malloc_init(void) { real_malloc = dlsym(RTLD_NEXT, "malloc"); // lookup the real malloc if (NULL == real_malloc) { fprintf(stderr, "Error in `dlsym`: %s\n", dlerror()); } } void *malloc(size_t size) { if(real_malloc==NULL) { malloc_init(); } void *p = real_malloc(size); // call the real malloc fprintf(stderr, "malloc(%d) = %p\n", size, p); return p; }
Let us compile this into a shared library:
$ gcc -shared -fPIC malloc.c -o libmalloc.so
Now, running the following:
$ LD_PRELOAD=`pwd`/libmalloc.so ls
will run ls while also printing all the malloc calls.
See also: man 8 ld.so
3. Using the contructor attribute
In the previous example, we did something complicated - override an existing function. We (ie zpoline) doesn't need something that complicated.
Instead, when we load libzpoline.so using LDPRELOAD, we need to simply run a function that does certain things - rewrite the loaded binary. This is done using the constructor attribute.
Take a look at the following.
#include <stdio.h> __attribute__((constructor)) static void myinit(void) { fprintf(stderr, "Haha"); }
We create a function myinit (marked static - since we don't need anyone to call it) and set the contructor attribute. Now, when this is loaded, as before using LDPRELOAD, this function will be run before the main program starts.
$ gcc -shared -fPIC -o libconstr.so constr.c $ LD_PRELOAD=`pwd`/libconstr.so ls Haha <output of ls>
You can specify a priority for the constructor function between 101 to 65536 (lesser ones are reserved) using the syntax ((constructor (105)))
for example.
Full documentation for attributes may be found here: https://gcc.gnu.org/onlinedocs/gcc-8.4.0/gcc/Common-Function-Attributes.html. (Search for constructor on that page.)
See how zpoline uses this here: https://github.com/yasukata/zpoline/blob/0a349e65c102f8f9bdbbf6da0a52c4006589178b/main.c#L545
zpoline needs to go at the very end, so they use the lowest priority possible (0xffff is 65535).
So, for now, using the LDPRELOAD system, zpoline is injected into the application and whatever it needs to do is run at the very beginning. Then, the application will run as normal (excepting any changes we did, of course).
This injected function does 3 things:
- Setup the trampoline at the very beginning.
- Rewrite all of the syscall calls to redirect to the trampoline at zero (hence the zpoline).
- Load the user defined hook function - the actual business logic to replace the original syscall.
4. Memory maps in the /proc file system
You probably know that the /proc
filesystem is a virtual file like view into the state of the system by processes. You can look up all sorts of things for a process of pid p by simple going to the directory /proc/p/
.
zpoline uses the /proc/self/maps
to begin to manage it's own memory. The function is here: https://github.com/yasukata/zpoline/blob/0a349e65c102f8f9bdbbf6da0a52c4006589178b/main.c#L336
Broadly, we are:
- "self" is used to refer to our own process. Remember, zpoline is now inside the target process.
- Lookup the memory maps.
- Filter out some that we don't want to touch, like the stack.
- Filter in the ones that have the "executable" bit set. Syscalls would not be called from other places.
- Run those portions through the disassembler to get hints (instead of blindly re-writing). (Kind of like how we manually ran objdump and looked at the instructions)
- Go and replace the instruction at those locations.
Let us look at the memory map for cat.
$ cat /proc/self/maps 55d6b9602000-55d6b960a000 r-xp 00000000 fd:02 805886940 /usr/bin/cat 55d6b9809000-55d6b980a000 r--p 00007000 fd:02 805886940 /usr/bin/cat 55d6b980a000-55d6b980b000 rw-p 00008000 fd:02 805886940 /usr/bin/cat 55d6bae87000-55d6baea8000 rw-p 00000000 00:00 0 [heap] 7f6ea0537000-7f6ead4ed000 r--p 00000000 fd:02 14146528 /usr/lib/locale/locale-archive 7f6ead4ed000-7f6ead6a8000 r-xp 00000000 fd:02 1673184 /usr/lib64/libc-2.28.so 7f6ead6a8000-7f6ead8a8000 ---p 001bb000 fd:02 1673184 /usr/lib64/libc-2.28.so 7f6ead8a8000-7f6ead8ac000 r--p 001bb000 fd:02 1673184 /usr/lib64/libc-2.28.so 7f6ead8ac000-7f6ead8ae000 rw-p 001bf000 fd:02 1673184 /usr/lib64/libc-2.28.so 7f6ead8ae000-7f6ead8b2000 rw-p 00000000 00:00 0 7f6ead8b2000-7f6ead8e0000 r-xp 00000000 fd:02 5042440 /usr/lib64/ld-2.28.so 7f6eadaa4000-7f6eadac9000 rw-p 00000000 00:00 0 7f6eadade000-7f6eadae0000 rw-p 00000000 00:00 0 7f6eadae0000-7f6eadae1000 r--p 0002e000 fd:02 5042440 /usr/lib64/ld-2.28.so 7f6eadae1000-7f6eadae3000 rw-p 0002f000 fd:02 5042440 /usr/lib64/ld-2.28.so 7ffdb0c8a000-7ffdb0cad000 rw-p 00000000 00:00 0 [stack] 7ffdb0d0b000-7ffdb0d0f000 r--p 00000000 00:00 0 [vvar] 7ffdb0d0f000-7ffdb0d11000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
The first column is the virtual memory address range. The second is the read/write/execute bits. For example, you can see that the "stack" and "heap" are not executable. You can see more info about these columns in man proc
.
5. Rewriting memory using the map
Let us write a dumb program to demonstrate the core functionality of the rewrite. In zpoline, they did an intelligent move of disassembling the loaded code and searching for the things to replace. Let us do something much simpler in this section.
Here is the program.
#include<stdio.h> #include<stdlib.h> #include<string.h> #include<assert.h> #include<unistd.h> void modify_heap(){ FILE *fp; assert((fp = fopen("/proc/self/maps", "r")) != NULL); char buf[4096]; // gets line by line while (fgets(buf, sizeof(buf), fp) != NULL) { // if this line has heap somewhere, we should be on the right entry if (strstr(buf, "heap") != NULL) { char addr[65] = { 0 }; // extract out the first column from the proc map entry char *c = strtok(buf, " "); strncpy(addr, c, sizeof(addr) - 1); // replace the "-" in the x-y string with a null byte int k; for (k = 0; k < strlen(addr); k++) { if (addr[k] == '-') { addr[k] = '\0'; break; } } int64_t from, to; from = strtol(&addr[0], NULL, 16); to = strtol(&addr[k + 1], NULL, 16); printf("From: %jd, To: %jd\n", from, to); // let us set the first int to 42 for (char* loc = from; loc < from + 1000; loc+=4) { *loc = 0x2a; *(loc+1) = 0x00; *(loc+2) = 0x00; *(loc+3) = 0x00; } // we have done what we came to do - no need to look at other entries break; } } } int main(){ void *beg = sbrk(0); int *a = malloc(sizeof(int)); *a = 10; printf("a = %d\n", *a); printf("beg = %d\n", beg); printf("addr a = %d\n", a); printf("delta = %d\n", (char*)a - (char*)beg); modify_heap(); printf("a = %d\n", *a); getc(stdin); free(a); }
At a high level, this is what we are doing:
- Using
sbrk(0)
to get the beginning of the heap (look atman sbrk
for more info. Point is sbrk simply extends the data segement). Comparing this with the malloc'ed address and checking the delta. I had naively assumed that it would be 0, or close to it. That wasn't the case. It would be nice to understand how exactly malloc works and why this memory is being reserved. But, we don't have the time for that now. - Getting the
/proc/self/maps
view into memory and getting to the heap entry. This code is as-is taken from the zpoline implementation and simplified for our usecase. Now, in this particular case, to modify heap, we don't need to look at the proc maps - we directly have thesbrk(0)
value. But, this is for illustration purposes. - Finally, overriding the memory with our own values. We are doing a brute-force override of the first 1000 bytes at the beginning of the heap section with a sequence of bytes, 1
0x2a
followed by 30x00
. What this is will become clear when you run the program.
If you run this program, you get:
$ ./a.out a = 10 beg = 37306368 addr a = 37307040 delta = 672 From: 37306368, To: 37441536 a = 42 <waiting for input>
Let us see the output line by line:
- We create a variable a on the heap. It's value is initially 10.
- The beggining of the heap is at address 37306368 in this instance, using sbrk.
- The int pointer a is allocated the address 37307040.
- The delta between these two is 672. (This is the part I can't explain for now).
- The modifyheap function opens the proc map and finds the heap entry. It extracts out the from and to addresses. You can confirm that the from address is the same as what sbrk returned.
- The modifyheap function overrides the first 1000 bytes of the heap with our custom values. (If you have a larger delta than 1000 on your system, the next step won't work. Retry by changing the range in the modifyheap function.)
- Finally, we print the value of a, which is now 42. How?
This is because 0x2a
is 42 and my system is Little Endian.
$ lscpu | grep Endian Byte Order: Little Endian
See this diagram on Wikipedia, if it helps understand the byte filling.
Now, while the program is running, you can manually go and check /proc/<pid>/maps
(get the pid using ps or pgrep).
There is one final thing here. If you Ctrl-C
the program now, everything is fine. If you actually enter a character like "a" and hit Enter (since we used getc in the program), you will see the following:
munmap_chunk(): invalid pointer Aborted (core dumped)
This is to be expected, since we went and messed up the entire heap, including parts which malloc must be using for its book keeping. It will be interesting to understand exactly how malloc uses the heap and work around this, but this is a topic for another day.
6. Creating the trampoline: mmaping address 0
We saw in the previous section, how to go about rewriting arbitrary addresses. But, the trampoline is not at an arbitrary address, but at the bottom of the memory region, right at 0.
The difference in these two cases, is that in one case, the address is already mapped. The address at 0 is definitely not mapped.
Since, there are other problems with using address 0 as detailed in the paper, let us work with some other address. In this section, we will aim to:
- Use mmap to map an arbitrary address.
- Programmatically load in some executable in direct binary form.
- Run the loaded function.
As you can see, this is a super simplified version of the trampoline. In the trampoline, the rewritten syscall will auto trigger the loaded function. In our case, let us do this manually.
6.1. Preparing our function
Let us start simple, with an add function.
int add(int a, int b) { return a + b; } int main() { }
You might wonder, what is the point of this program? The point is, we shall compile it and disassemble it to get the binary instructions. In my case, it looks like the following:
0000000000400536 <add>: 400536: 55 push %rbp 400537: 48 89 e5 mov %rsp,%rbp 40053a: 89 7d fc mov %edi,-0x4(%rbp) 40053d: 89 75 f8 mov %esi,-0x8(%rbp) 400540: 8b 55 fc mov -0x4(%rbp),%edx 400543: 8b 45 f8 mov -0x8(%rbp),%eax 400546: 01 d0 add %edx,%eax 400548: 5d pop %rbp 400549: c3 retq
Keep this aside, it will be used later - to be loaded into our runner.
6.2. The main program
Here, we will write a simple version of the program which maps an arbitrary address, writes some binary data into it and then executes it.
#include <sys/mman.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> // helper to wait for user input void uwait() { printf("Press y to continue...\n"); getc(stdin); int c; while ( (c = getchar()) != '\n' && c != EOF ) { } } int (*myfun) (int a, int b); int main() { char *mem; printf("Check maps now using: cat /proc/%d/maps\n", getpid()); uwait(); /* allocate memory at virtual address 0 */ mem = mmap(0x32000, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); if (mem == MAP_FAILED) { fprintf(stderr, "map failed\n"); fprintf(stderr, "NOTE: /proc/sys/vm/mmap_min_addr should be set 0\n"); exit(1); } printf("Mmap done!\n"); printf("Check maps now using: cat /proc/%d/maps\n", getpid()); uwait(); // hold pointer to starting point, since we will keep moving the mem pointer void *base = mem; // macro to help load data #define W(d) *mem = d; mem++ // load binary in W(0x55); W(0x48); W(0x89); W(0xe5); W(0x89); W(0x7d); W(0xfc); W(0x89); W(0x75); W(0xf8); W(0x8b); W(0x55); W(0xfc); W(0x8b); W(0x45); W(0xf8); W(0x01); W(0xd0); W(0x5d); W(0xc3); printf("Loaded binary.\n"); // execute the virtual function now myfun = (int (*)(int, int))base; printf("Sum is: %d\n", myfun(12, 30)); }
This is a bit long, so let us understand this step by step.
- The function uwait is just a helper to pause the program. This is useful to examine proc maps in between the processing.
- We call mmap to load 1 page - 4kb - 4096 bytes (0x1000 in hexadecimal) at the location 0x32000. Note that it is important to choose a start address which aligns with page size, or an error is thrown. On my system, 0 didn't work even with mmapminaddr set to 0, but I will ignore that for the moment. The value 0x32000 is chosen to not clash with any existing value in the maps.
- Once we have the memory, we manually fill it byte-by-byte with the binary data we already have by examining the earlier compiled add function. To do this, we use a helper macro which simply fills the current location "mem" and increments the pointer. Since we are hardcoding the binary, this will obviously only work on the same instructions set, hardware combination. You can compile the add function on your system and see the produced binary and change this if needed.
- We create function pointer (with matching signatures) and point it to the base location of this new mapped memory. Finally we call the function.
When you run it, you will see the following:
(base) chandergovind:zpoline$ ./a.out Check maps now using: cat /proc/1238745/maps Press y to continue... y Mmap done! Check maps now using: cat /proc/1238745/maps Press y to continue... y Loaded binary. Sum is: 42
Look at that! We calculated the sum of 12 and 30 by running them through our loaded binary. I keep wanting to write code, but as you see, it is not really code. Maybe compiled code, or binary instructions.
If I examine the proc maps file at the 2 locations where it prompts you to, you will see 1 line different, the first one, pointing to address 0x32000.
00032000-00033000 rwxp 00000000 00:00 0 00400000-00401000 r-xp 00000000 fd:02 268576133 /home/chandergovind/Documents/Dabblings/zpoline/a.out 00600000-00601000 r--p 00000000 fd:02 268576133 /home/chandergovind/Documents/Dabblings/zpoline/a.out 00601000-00602000 rw-p 00001000 fd:02 268576133 /home/chandergovind/Documents/Dabblings/zpoline/a.out 009ad000-009ce000 rw-p 00000000 00:00 0 [heap] 7f86c864e000-7f86c8809000 r-xp 00000000 fd:02 1673184 /usr/lib64/libc-2.28.so ...
All other lines are the same, obviously. Notice how the "read", "write" and "execute" protection bits are all set. If we didn't set the "x" bit, we wouldn't be able to execute the code like we just did.
7. The disasm library
Since this is a new topic to me, let us do this in a separate part 2.
8. dlmopen
The paper uses dlmopen to load the actual hook functions, which should NOT be re-written. Let us do this in part 2.
9. Creating a zpoline launcher
The paper mentions an alternative approach to LDPRELOAD needed for static binaries, though this is not there in the repo.
Check it for yourself. Create the simple constr library, and try to run it against a simple static binary like we did earlier. You will find that the library is never loaded.
Instead, we would need a different way to rewrite the given input program. Let us do this in Part 3.
10. Conclusion
This was a long post. Hopefully, this gives you some useful information - I certainly learnt a number of things as I worked through zpoline.