Add support for multi-byte transitions
Opened this issue · 3 comments
Automa produces bad code to match string literals.
The following machine: compile(re"abracadabra")
produces this code:
Machine code
.text
subq $24, %rsp
movq (%rdi), %rdx
leaq 8(%rdi), %rcx
movq %rcx, 8(%rsp)
movq %rdx, 16(%rsp)
testq %rdx, %rdx
je L267
movb (%rcx), %al
cmpb $97, %al
jne L277
cmpq $2, %rdx
jb L287
movb 1(%rcx), %al
cmpb $98, %al
jne L297
cmpq $3, %rdx
jb L307
movb 2(%rcx), %al
cmpb $114, %al
jne L314
cmpq $4, %rdx
jb L324
movb 3(%rcx), %al
cmpb $97, %al
jne L331
cmpq $5, %rdx
jb L341
movb 4(%rcx), %al
cmpb $99, %al
jne L348
cmpq $6, %rdx
jb L355
movb 5(%rcx), %al
cmpb $97, %al
jne L362
cmpq $7, %rdx
jb L369
movb 6(%rcx), %al
cmpb $100, %al
jne L376
cmpq $8, %rdx
jb L383
movb 7(%rcx), %al
cmpb $97, %al
jne L390
cmpq $9, %rdx
jb L397
movb 16(%rdi), %al
cmpb $98, %al
jne L404
cmpq $10, %rdx
jb L411
movb 9(%rcx), %al
cmpb $114, %al
jne L418
cmpq $11, %rdx
jb L425
movb 10(%rcx), %al
cmpb $97, %al
jne L462
movb $1, %al
addq $24, %rsp
retq
L267:
movl $1, %ecx
jmp L430
L277:
movl $1, %esi
jmp L477
L287:
movl $2, %ecx
jmp L430
L297:
movl $2, %esi
jmp L477
L307:
movl $3, %ecx
jmp L430
L314:
movl $3, %esi
jmp L477
L324:
movl $4, %ecx
jmp L430
L331:
movl $4, %esi
jmp L477
L341:
movl $5, %ecx
jmp L430
L348:
movl $5, %esi
jmp L477
L355:
movl $6, %ecx
jmp L430
L362:
movl $6, %esi
jmp L477
L369:
movl $7, %ecx
jmp L430
L376:
movl $7, %esi
jmp L477
L383:
movl $8, %ecx
jmp L430
L390:
movl $8, %esi
jmp L477
L397:
movl $9, %ecx
jmp L430
L404:
movl $9, %esi
jmp L477
L425:
movl $11, %ecx
L430:
movabsq $throw_input_error, %rax
movabsq $139695384118736, %rdi # imm = 0x7F0D5DBF45D0
leaq 8(%rsp), %rdx
movq %rcx, %rsi
callq *%rax
ud2
L462:
movl $11, %esi
jmp L477
L469:
movb 11(%rcx), %al
movl $12, %esi
L477:
movzbl %al, %edx
movabsq $throw_input_error, %rax
movabsq $139695384118736, %rdi # imm = 0x7F0D5DBF45D0
leaq 8(%rsp), %rcx
movq %rsi, %r8
callq *%rax
ud2
Which is almost funny in how bad it is. It's not a big deal for FASTA/SAM inputs, because it has no long strings to match.
But we can do better, literal string matching is not a difficult algorithm.
The hard part is: What about when you read from an IO? Then the buffer could run out in the middle of the string, in which case the machine should be able to match e.g. "abra", then reload the buffer, then match "cadabra". I'm not sure how to handle that.
Putting that issue aside, I propose implementing this deeply, perhaps at every level of Automa, from the RE objects where catting string/char literals should result in just a literal string, to NFAs/DFAs, where literal strings should be a single edge, to Machines.
It's a larger undertaking but in a sense straightforward, and it could be fun.
and it could be fun.
Famous last words 😆
I don't have any idea what most of this means, but it sounds awesome :-P. My main question is: would this substantially improve speed, correctness, usability, or something else that justifies the effort? I mean, if it's fun, by all means do it even if it's just an intellectual exercise to improve "elegance" without other concrete benefit.
It would improve compilation speed and speed of generated code (for string literals), and also make compound regex, nfas dfas and machines have fewer states and so be easier to plot and debug
All of that sounds great! Go forth 😛