/mri-iseq

A library and CLI tool to interact with MRI instruction sequences.

Primary LanguageRustGNU Affero General Public License v3.0AGPL-3.0

The MRI Instruction Sequences Tool

The MRI Instruction Sequence Format

This section is a reference to the Ruby instruction sequence format. It is also know as IBF, which I believe stands for internal binary format. You can find this acronym both in the MRI source code as well as in this program. An IBF file is made of multiple sections, namely the header, the data, and the lists.

The Header

The header is made of 12 fields. Before I list them, it must be understood that when I write unsigned int, I really mean unsigned int, whatever it means on your platform, and not something like a 4 byte, little endian number. This format was designed to cache instruction sequence and be as fast as possible to load. The tradeoff that it is not portable between machines.

Magic number
Always the four byte string YARB.
Major version number
An unsigned int that matches the major version of MRI.
Minor version number
An unsigned int that matches the minor version of MRI.
Size
The total size of the file in bytes, including this header, as an unsigned int.
Extra size
???
List size
The number of entries in the instruction sequence, id, or object list, as an unsigned int.
List offset
The offset in bytes as an unsigned int of the instruction sequence, id, or object list.
Platform
A null terminated, C-style string that matches RUBY_PLATFORM.
Magic numberMajor version number
Minor version numberSize
Extra sizeInstruction sequence list size
Id list sizeObject list size
Instruction sequence list offsetId list offset
Object list offsetPlatform

The Lists

Instruction sequence and object lists are sequences of integer entries. Each of them is an offset in bytes in the IBF file, encoded as an unsigned int

Id lists are lists of long entries. They are offsets, in number of entries and not bytes, into the object list. The object they are pointing to is the string representation of the id.

The Data

Instruction Sequences

Optional Arguments Labels

Every instruction sequence has a list of labels for optional arguments. In the common case where there are no optional arguments, there is a single label pointing to the beginning of the sequence. More generally, if there are $n$ optional arguments, then this list contains $n + 1$ labels. The instructions are meant to be layed out so the first optional argument is set to its default value first, so that the interpreter jumps to the label at position $m$, where $m$ is the number of optional arguments given.

Take for example the program

def a(x = 1, y = 2, z = 3)
  x
end

Which yields the following instructions when compiled.

local table (size: 3, argc: 0 [opts: 3, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 3] x<Opt=0>   [ 2] y<Opt=3>   [ 1] z<Opt=7>
0000 putobject_OP_INT2FIX_O_1_C_                                      (   1)
0001 setlocal_OP__WC__0 x
0003 putobject        2
0005 setlocal_OP__WC__0 y
0007 putobject        3
0009 setlocal_OP__WC__0 z
0011 getlocal_OP__WC__0 x[LiCa]
0013 leave            [Re]

The interpreter is meant to jump to offset 0 if 3 optional arguments are missing, offset 3 if 2 are missing, offset 7 if 1 is missing, and offset 11 (not shown) if none are missing.

Objects

The following objects can be stored in an IBF file. An up to date list can be found in load_object_functions in compile.c.

Any “special const”
This include values such as true, false, fixnums, etc.
T_CLASS
Ruby classes, by name.
T_FLOAT
Double precision, architecture specific floating point numbers.
T_STRING
Strings, content and encoding.
T_REGEXP
Regular expressions.
T_ARRAY
Arrays.
T_HASH
Hashes.
T_STRUCT
Structures.
T_BIGNUM
Arbitrary precision integers.
T_DATA
???
T_COMPLEX
Complex number.
T_RATIONAL
Rational number.
T_SYMBOL
Symbol.

Of course, composite types (arrays, hashes, and structures) must only contain values that are also storable.

Glossary

Id

It is the untagged representation of interned strings in MRI, usually called symbols.