CensoredUsername/dynasm-rs

Use smaller operand size PC-relative jumps where possible

eira-fransham opened this issue · 1 comments

I'm not deeply familiar with assembly, assemblers or compilers, but I believe that one optimisation that might be useful to do in dynasm (and that isn't really possible to do as a user of the library) is branch relaxation. Essentially, this means using the branch instructions that take a PC-relative offset of 8 bits or 16 bits when possible instead of always using pointer-sized absolute immediate jumps, with the reasoning that these are faster (and smaller?) for the majority-case of doing a same-function jump of only a few instructions. AFAICT this isn't done by dynasm right now, it just always does a static jump to an immediate of a fixed size, and then patches the value of the immediate only.

Of course, as I said I'm not too familiar with assemblers and so it's possible that these non-PC-relative jumps are always faster for non-PIE code and the 8-bit PC-relative jumps are only generated by LLVM etc. because they need to generate PIEs, whereas dynasm does not.

Dynasm actually does support short jumps. In x86 mode you can use 8-bit PC-rel, 16 bit PC-rel and 32-bit absolute jumps. In x64 mode you can use 8-bit PC-rel and 32-bit PC-rel jumps (no 16-bit rel as Intel doesn't support those for some reason. AMD does support them).

However the assembler architecture requires that the user indicates the wanted size. e.g. jmp BYTE >label. This is due to dynasm being a single-pass assembler, which means that the size of the instructions must be known when they are evaluated the first time, and later emitted relocations will not be able to change the size of instructions previously emitted. If this wasn't the case offsets to instructions would shift around as the actual label definitions happen.

So branch relaxation is possible, but the user of the library is responsible for implementing it. Next to that the benefits of it aren't that big (especially in x64 mode where all immediate jumps are relative). An argument could be made about instruction cache but if you're optimizing to that extent a single-pass assembler is probably not what you want.