/jvo-asm

x86 assembler from scratch.

Primary LanguageRustGNU General Public License v3.0GPL-3.0

jvo-asm https://travis-ci.com/jorenvo/jvo-asm.svg?branch=master

This is a toy x86 assembler written from scratch. It was written to gain a better understanding of how machine code and executable files work. Its syntax uses a lot of emojis because why not?

Usage

Using the print example:

$ cargo run -- examples/print.jas
hi!

Features

Constants

πŸ–ŠLINUX_SYSCALL $128
# ...
❗ LINUX_SYSCALL

Comments

# I'm a comment
🦘= βœ‰exit

Addressing

Immediate addressing

⚫ β¬… $8

Load 8 into ⚫.

Register addressing

πŸ”΄ β¬… πŸ”΅

Copies data from πŸ”΅ into πŸ”΄.

Direct addressing

πŸ“—my_number 3
# ...
πŸ”΄ β¬… my_number

This loads 3 into πŸ”΄.

Indirect addressing

πŸ”΄ β¬… $0~πŸ”΅

This loads the value at the address contained in πŸ”΅ into πŸ”΄.

Base pointer addressing

πŸ”΄ β¬… $4~πŸ”΅

Or alternatively with a constant:

πŸ–ŠST_ARG $8
# ...
πŸ”΄ β¬… ST_ARG~πŸ”΅

This is similar to indirect addressing except that it adds a constant offset to the address in πŸ”΅.

Labels

🦘 βœ‰exit
# ...
πŸ“ͺexit:
βšͺ β¬… $1
❗ LINUX_SYSCALL

Labels are defined by prefixing them with πŸ“ͺ and ending them with a :. To refer to a label prefix it with βœ‰ instead.

Data sections

πŸ“—numbers 3, 67, 34, 222, 45
# ...
πŸ”΅ β¬… numbers

Data sections start with πŸ“— and can be referred to later by just their name.

Implementation notes

The main high-level function which processes a file is process. First the code is broken up into separate lines. Each line is then tokenized into a vector of TokenType. ConstantReferences are replaced by their constants and the vector is compiled into a vector of IntermediateCode. Intermediate code consists of bytes and displacements. We need this intermediate step because e.g. a jump to an instruction further down the program can not be encoded, when we encounter a jump to a next instruction we don’t know yet how far to jump. After this we iterate through the IntermediateCode and replace the displacements with bytes. This is done by keeping track of the byte offset of each instruction in the program during the first step.

After this an ELF binary is built. Its layout is as follows (the multiple data sections example was used here):

$ readelf -a a.out
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x804b000
  Start of program headers:          52 (bytes into file)
  Start of section headers:          148 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         3
  Size of section headers:           40 (bytes)
  Number of section headers:         5
  Section header string table index: 4

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] pi                PROGBITS        08049000 001000 000014 00  WA  0   0  1
  [ 2] euler             PROGBITS        0804a000 002000 000014 00  WA  0   0  1
  [ 3] .code             PROGBITS        0804b000 003000 000019 00  AX  0   0  1
  [ 4] .shstrtab         STRTAB          00000000 000400 00001a 00      0   0  1

...

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x003000 0x0804b000 0x0804b000 0x00019 0x00019 R E 0x1000
  LOAD           0x001000 0x08049000 0x08049000 0x00014 0x00014 RW  0x1000
  LOAD           0x002000 0x0804a000 0x0804a000 0x00014 0x00014 RW  0x1000

 Section to Segment mapping:
  Segment Sections...
   00     .code
   01     pi
   02     euler

...

There’s a program header entry for each data section (πŸ“—) and for the executable code. Everything is padded to 4 KB (=virtual page size). To allow for linking a correct section header is also generated.

Instruction reference

Registers

SymbolName
βšͺ%eax
πŸ”΄%ebx
πŸ”΅%ecx
⚫%edx
β—€%esp
⬇%ebp

Instructions

SymbolExampleDescription
↩↩Return from a function
πŸ“žπŸ“ž fnCall function
βž•βšͺ βž• ⚫βšͺ += ⚫
βž–βšͺ βž– ⚫βšͺ -= ⚫
βœ–βšͺ βœ– ⚫βšͺ *= ⚫
β¬…πŸ”΄ β¬… $1Move into register
❗❗ $128Interrupt
βš–βš– ⚫, βšͺCompare ⚫ to βšͺ
🦘=🦘= βœ‰exitJump if equal
πŸ¦˜β‰ πŸ¦˜β‰  βœ‰exitJump if not equal
🦘<🦘< βœ‰exitJump if less than
πŸ¦˜β‰€πŸ¦˜β‰€ βœ‰exitJump if less or equal
🦘>🦘> βœ‰exitJump if greater than
🦘β‰₯🦘β‰₯ βœ‰exitJump if greater or equal
🦘🦘 βœ‰exitUnconditional jump
πŸ“₯πŸ“₯ $8Push onto stack
πŸ“€πŸ“€ πŸ”΅Pop from stack
πŸ–ŠπŸ–Šc $4Define constant c to be 4
πŸ“ͺ (ends with :)πŸ“ͺexit:Define a label with name exit
πŸ“—πŸ“—pi 3, 1, 4Define a data section pi containing 3 integers
βœ‰βœ‰exitRefer to a previously defined (πŸ“ͺ) exit label
$$11 is a number
## hi!hi! is a comment
[0-9]+11 is a memory address
[aA-zZ]+constantconstant is a previously defined (πŸ–Š, πŸ“—) constant