Tiny x64 Hello World
A step by step adventure to find out how small a x64 binary can be which prints "hello, world".
- OS: CentOS 7, Linux 3.10.0-862.el7.x86_64
- GCC: gcc (Homebrew GCC 5.5.0_7) 5.5.0
Overview
make build
: build all stepsmake list
: list binary size of all stepsfinal/hello.out
is our final result, and it's 170 bytes 🎉
$ make build && make list
-rwxr-xr-x 1 root root 16712 Dec 4 00:08 step0/hello.out
-rwxr-xr-x 1 root root 14512 Dec 4 00:08 step1/hello.out
-rwxr-xr-x 1 root root 14512 Dec 4 00:08 step2/hello.out
-rwxr-xr-x 1 root root 13664 Dec 4 00:08 step3/hello.out
-rwxr-xr-x 1 root root 12912 Dec 4 00:08 step4/hello.out
-rwxr-xr-x 1 root root 584 Dec 4 00:08 step5/hello.out
-rwxr-xr-x 1 root root 440 Dec 4 00:08 step6/hello.out
-rwxr-xr-x 1 root root 170 Dec 4 00:08 step7/hello.out
-rwxr-xr-x 1 root root 170 Dec 4 00:08 final/hello.out
Step0
This is our first try, the good old program to print "hello, world".
#include <stdio.h>
int
main()
{
printf("hello, world\n");
return 0;
}
cd step0 && make build
and we get our first hello.out
binary.
Unfortunately, it's too big, 16712 bytes!
Step1: Strip Symbols
Let's take an easy move to strip all the symbols.
Let gcc -c
do the work and now we get out new binary, it's 14512 bytes.
Still big, but hey, we certainly make a progress, do hurry 😉.
Step2: Optimization
Modern compilers can do a lot of "magic" to optimize our program, let's give it a try.
gcc -O3
enable the maximum optimization level and we will find oud that our binary size keeps the same 😢, 14512 bytes.
It actually makes sense though. Our program is too simple, there isn't any room left to optimize.
Step3: Remove Startup Files
Our C program always starts with main
, but Do you ever wonder who calls main
?
It turns out the main
function is being called by something called crt
, the C runtime library which is implemented by the compiler.
If we remove it, our binary must be smaller, right?
We need to change our program a little bit.
- Let's change our entry function name to
nomain
to make the fact more obviously that we don't use crt - Since we don't use crt, we need to explicitly use system call to exit
#include <stdio.h>
#include <unistd.h>
int
nomain()
{
printf("hello, world\n");
_exit(0);
}
You must wonder why we are using _exit
rather than exit
? Good old StackOverflow always helps, check What is the difference between using _exit() & exit() in a conventional Linux fork-exec?
.
Use gcc -e nomain -nostartfiles
to compiler our program and now our binary is 13664 bytes.
We are making a progress again!
Step4: Remove Standard Library
We can go more wilder. We don't need the crt to do the startup, why do we need to use printf
to print? We can certainly do it on our own!
To print something to the terminal, we need to use the write
system call. Here is the full x64 system call table.
To directly invoke system call in C, we need to use inline assembly.
char *str = "hello, world\n";
void
myprint()
{
asm("movq $1, %%rax \n"
"movq $1, %%rdi \n"
"movq %0, %%rsi \n"
"movq $13, %%rdx \n"
"syscall \n"
: // no output
: "r"(str)
: "rax", "rdi", "rsi", "rdx");
}
void
myexit()
{
asm("movq $60, %rax \n"
"xor %rdi, %rdi \n"
"syscall \n");
}
int
nomain()
{
myprint();
myexit();
}
Looks kind of messy, but it's actually very simple code.
gcc -nostdlib
to tell GCC we don't want the standard library since we are cool enough to do all the tings by ourselves.
And we get 12912 bytes.
Our program doesn't depend on anyting now, but it's still very big, why????
Step5: Custom Linker Script
Let's examine sctions of our binary.
$ readelf -S -W step4/hello.out
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000401000 001000 00006e 00 AX 0 0 16
[ 2] .rodata PROGBITS 0000000000402000 002000 00000e 01 AMS 0 0 1
[ 3] .eh_frame_hdr PROGBITS 0000000000402010 002010 000024 00 A 0 0 4
[ 4] .eh_frame PROGBITS 0000000000402038 002038 000054 00 A 0 0 8
[ 5] .data PROGBITS 0000000000404000 003000 000008 00 WA 0 0 8
[ 6] .comment PROGBITS 0000000000000000 003008 000022 01 MS 0 0 1
[ 7] .shstrtab STRTAB 0000000000000000 00302a 000040 00 0 0 1
Off
is kind of strange, some sections start with very big offset.
Maybe linker does some alignment? Check ld --verbose
and yes it does!
So our binary is so big because of alignment, if we use xxd
to see the binary content we can see that there are a lot of zeroes.
Time to write our own linker script link.lds
.
ENTRY(nomain)
SECTIONS
{
. = 0x8048000 + SIZEOF_HEADERS;
tiny : { *(.text) *(.data) *(.rodata*) }
/DISCARD/ : { *(*) }
}
gcc -T link.lds
and we get 584 bytes, a huge step 🔥.
Step6: Assembly
Can we do better? There is nothing we can do inside the C world, it's time to move to the lower level.
Let's write some assembly code! It sounds terrfying, but just give it a try, you will find it's actually very interesting.
We are the God of computer, we can control everyting!
section .data
message: db "hello, world", 0xa
section .text
global nomain
nomain:
mov rax, 1
mov rdi, 1
mov rsi, message
mov rdx, 13
syscall
mov rax, 60
xor rdi, rdi
syscall
Use nasm -f elf64
to assemble our code and we get 440 bytes.
Step7: Handmade Binary
Is there anyting we can do now? We are at the lowest level, there is no "lower-level" for us to go.
There is no room for our code, but the binary that runs on OS is not just the code. It is a file format called ELF and it contains some extra info.
So maybe we can do something to shrink that extra info?
Or maybe we can write the ELF from scratch? This way, we can control every bit of our binary.
BITS 64
org 0x400000
ehdr: ; Elf64_Ehdr
db 0x7f, "ELF", 2, 1, 1, 0 ; e_ident
times 8 db 0
dw 2 ; e_type
dw 0x3e ; e_machine
dd 1 ; e_version
dq _start ; e_entry
dq phdr - $$ ; e_phoff
dq 0 ; e_shoff
dd 0 ; e_flags
dw ehdrsize ; e_ehsize
dw phdrsize ; e_phentsize
dw 1 ; e_phnum
dw 0 ; e_shentsize
dw 0 ; e_shnum
dw 0 ; e_shstrndx
ehdrsize equ $ - ehdr
phdr: ; Elf64_Phdr
dd 1 ; p_type
dd 5 ; p_flags
dq 0 ; p_offset
dq $$ ; p_vaddr
dq $$ ; p_paddr
dq filesize ; p_filesz
dq filesize ; p_memsz
dq 0x1000 ; p_align
phdrsize equ $ - phdr
_start:
mov rax, 1
mov rdi, 1
mov rsi, message
mov rdx, 13
syscall
mov rax, 60
xor rdi, rdi
syscall
message: db "hello, world", 0xa
filesize equ $ - $$
nasm -f bin
to bake our binary and our final result is 170 bytes.
Final Binary Anatomy
And now, we reach the final limit, 170 bytes, there is no way to reduce that any more.
PS: Actually, there is, check the post A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux. I am not gonna use techniques in this post, because they are so "hack".
Now let's see what exactly every byte does in our 170-bytes final binary.
# ELF Header
00: 7f 45 4c 46 02 01 01 00 # e_ident
08: 00 00 00 00 00 00 00 00 # reserved
10: 02 00 # e_type
12: 3e 00 # e_machine
14: 01 00 00 00 # e_version
18: 78 00 40 00 00 00 00 00 # e_entry
20: 40 00 00 00 00 00 00 00 # e_phoff
28: 00 00 00 00 00 00 00 00 # e_shoff
30: 00 00 00 00 # e_flags
34: 40 00 # e_ehsize
36: 38 00 # e_phentsize
38: 01 00 # e_phnum
3a: 00 00 # e_shentsize
3c: 00 00 # e_shnum
3e: 00 00 # e_shstrndx
# Program Header
40: 01 00 00 00 # p_type
44: 05 00 00 00 # p_flags
48: 00 00 00 00 00 00 00 00 # p_offset
50: 00 00 40 00 00 00 00 00 # p_vaddr
58: 00 00 40 00 00 00 00 00 # p_paddr
60: aa 00 00 00 00 00 00 00 # p_filesz
68: aa 00 00 00 00 00 00 00 # p_memsz
70: 00 10 00 00 00 00 00 00 # p_align
# Code
78: b8 01 00 00 00 # mov $0x1,%eax
7d: bf 01 00 00 00 # mov $0x1,%edi
82: 48 be 9d 00 40 00 00 00 00 00 # movabs $0x40009d,%rsi
8c: ba 0d 00 00 00 # mov $0xd,%edx
91: 0f 05 # syscall
93: b8 3c 00 00 00 # mov $0x3c,%eax
98: 48 31 ff # xor %rdi,%rdi
9b: 0f 05 # syscall
9d: 68 65 6c 6c 6f 2c 20 77 6f 72 6c 64 0a # "hello, world\n"