BioJulia/Automa.jl

Add support for multi-byte transitions

Opened this issue · 3 comments

Automa produces bad code to match string literals.
The following machine: compile(re"abracadabra")
produces this code:

Machine code
        .text
        subq    $24, %rsp
        movq    (%rdi), %rdx
        leaq    8(%rdi), %rcx
        movq    %rcx, 8(%rsp)
        movq    %rdx, 16(%rsp)
        testq   %rdx, %rdx
        je      L267
        movb    (%rcx), %al
        cmpb    $97, %al
        jne     L277
        cmpq    $2, %rdx
        jb      L287
        movb    1(%rcx), %al
        cmpb    $98, %al
        jne     L297
        cmpq    $3, %rdx
        jb      L307
        movb    2(%rcx), %al
        cmpb    $114, %al
        jne     L314
        cmpq    $4, %rdx
        jb      L324
        movb    3(%rcx), %al
        cmpb    $97, %al
        jne     L331
        cmpq    $5, %rdx
        jb      L341
        movb    4(%rcx), %al
        cmpb    $99, %al
        jne     L348
        cmpq    $6, %rdx
        jb      L355
        movb    5(%rcx), %al
        cmpb    $97, %al
        jne     L362
        cmpq    $7, %rdx
        jb      L369
        movb    6(%rcx), %al
        cmpb    $100, %al
        jne     L376
        cmpq    $8, %rdx
        jb      L383
        movb    7(%rcx), %al
        cmpb    $97, %al
        jne     L390
        cmpq    $9, %rdx
        jb      L397
        movb    16(%rdi), %al
        cmpb    $98, %al
        jne     L404
        cmpq    $10, %rdx
        jb      L411
        movb    9(%rcx), %al
        cmpb    $114, %al
        jne     L418
        cmpq    $11, %rdx
        jb      L425
        movb    10(%rcx), %al
        cmpb    $97, %al
        jne     L462
        movb    $1, %al
        addq    $24, %rsp
        retq
L267:
        movl    $1, %ecx
        jmp     L430
L277:
        movl    $1, %esi
        jmp     L477
L287:
        movl    $2, %ecx
        jmp     L430
L297:
        movl    $2, %esi
        jmp     L477
L307:
        movl    $3, %ecx
        jmp     L430
L314:
        movl    $3, %esi
        jmp     L477
L324:
        movl    $4, %ecx
        jmp     L430
L331:
        movl    $4, %esi
        jmp     L477
L341:
        movl    $5, %ecx
        jmp     L430
L348:
        movl    $5, %esi
        jmp     L477
L355:
        movl    $6, %ecx
        jmp     L430
L362:
        movl    $6, %esi
        jmp     L477
L369:
        movl    $7, %ecx
        jmp     L430
L376:
        movl    $7, %esi
        jmp     L477
L383:
        movl    $8, %ecx
        jmp     L430
L390:
        movl    $8, %esi
        jmp     L477
L397:
        movl    $9, %ecx
        jmp     L430
L404:
        movl    $9, %esi
        jmp     L477
L425:
        movl    $11, %ecx
L430:
        movabsq $throw_input_error, %rax
        movabsq $139695384118736, %rdi          # imm = 0x7F0D5DBF45D0
        leaq    8(%rsp), %rdx
        movq    %rcx, %rsi
        callq   *%rax
        ud2
L462:
        movl    $11, %esi
        jmp     L477
L469:
        movb    11(%rcx), %al
        movl    $12, %esi
L477:
        movzbl  %al, %edx
        movabsq $throw_input_error, %rax
        movabsq $139695384118736, %rdi          # imm = 0x7F0D5DBF45D0
        leaq    8(%rsp), %rcx
        movq    %rsi, %r8
        callq   *%rax
        ud2

Which is almost funny in how bad it is. It's not a big deal for FASTA/SAM inputs, because it has no long strings to match.
But we can do better, literal string matching is not a difficult algorithm.

The hard part is: What about when you read from an IO? Then the buffer could run out in the middle of the string, in which case the machine should be able to match e.g. "abra", then reload the buffer, then match "cadabra". I'm not sure how to handle that.

Putting that issue aside, I propose implementing this deeply, perhaps at every level of Automa, from the RE objects where catting string/char literals should result in just a literal string, to NFAs/DFAs, where literal strings should be a single edge, to Machines.

It's a larger undertaking but in a sense straightforward, and it could be fun.

and it could be fun.

Famous last words 😆

I don't have any idea what most of this means, but it sounds awesome :-P. My main question is: would this substantially improve speed, correctness, usability, or something else that justifies the effort? I mean, if it's fun, by all means do it even if it's just an intellectual exercise to improve "elegance" without other concrete benefit.

It would improve compilation speed and speed of generated code (for string literals), and also make compound regex, nfas dfas and machines have fewer states and so be easier to plot and debug

All of that sounds great! Go forth 😛