Regex generation does not work
aw632 opened this issue · 1 comments
Describe the issue as clearly as possible:
Regex generation with certain regex strings will produce strings that don't match the regex.
I have this regex string:
^(\[TOOL_CALLS\] \{\}|(?!\[TOOL_CALLS\]).*)$
Since interregular has implicit anchoring, I use this instead:
(\[TOOL_CALLS\] \{\}|(?!\[TOOL_CALLS\]).*)
However, with this, I am getting outputs that don't match the original regex (with anchoring). Instead, the outputs are consistent with the regex without anchoring. See this test online, and try to remove/add the ^ and $: https://regex101.com/r/EZIPmV/2. For instance, I'm getting outputs like
[TOOL_CALLS] {}, {"name": "
The interregular maintainer confirmed that the anchoring is implicit in that dependency, so I've narrowed it down to just outlines itself having this issue.
Note: generate
will strip out the special tokens like [TOOL_CALLS]
, but you can see them if you modify generate
or if you use a different inference engine like vLLM (I was able to produce the same issues there as well).
Steps/code to reproduce the bug:
from outlines import models, generate
model = models.transformers("mistralai/Mistral-Nemo-Instruct-2407")
generator = generate.regex(
model,
r'(\[TOOL_CALLS\] \{\}|(?!\[TOOL_CALLS\]).*)',
)
prompt = """[AVAILABLE_TOOLS][{"type": "function", "function": {"name": "add", "description": "add two numbers", "parameters": {"type": "object", "properties": {"a": {"description": "First number", "type": "integer"}, "b": {"description": "Second number", "type": "integer"}}, "required": ["a", "b"]}}}, {"type": "function", "function": {"name": "multiply", "description": "multiply two numbers", "parameters": {"type": "object", "properties": {"a": {"description": "First number", "type": "integer"}, "b": {"description": "Second number", "type": "integer"}}, "required": ["a", "b"]}}}][/AVAILABLE_TOOLS][INST]What is 5 + 9?[/INST]"""
answer = generator(prompt, max_tokens=300)
print(f"{answer=}")
Expected result:
[TOOL_CALLS] {}
Error message:
No response
Outlines/Python version information:
Version information
Context for the issue:
No response
The FSM produced by interegular
cannot produce any complete strings.
This is likely caused by interegulars incomplete negative lookaround implementation
>>> import interegular
>>> pattern = r"(\[TOOL_CALLS\] \{\}|(?!\[TOOL_CALLS\]).*)"
>>> fsm = interegular.parse_pattern(pattern).to_fsm()
>>> ["".join(s) for s in fsm.strings(100)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "/home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py", line 684, in strings
raise ValueError(f"Couldn't find an example within {max_iterations} iterations")
ValueError: Couldn't find an example within 100 iterations
It is valid with re
, but not interegular
>>> import re
>>> re.match(pattern, "[TOOL_CALLS] {}")
<re.Match object; span=(0, 15), match='[TOOL_CALLS] {}'>
>>> fsm.accepts("[TOOL_CALLS] {}")
False
>>> re.match(pattern, '[TOOL_CALLS] {}, {"name": "toolname"}')
<re.Match object; span=(0, 15), match='[TOOL_CALLS] {}'>
>>> fsm.accepts('[TOOL_CALLS] {}, {"name": "toolname"}')
False
You might consider a simpler pattern. Please let me know if you have any other questions.