Crash when trying to collate pretokenized witnesses with space or punctuation
Closed this issue · 2 comments
If you have a witness that looks like this:
{
"id": "B",
"tokens": [
{"t": "A"},
{"t": "white", "adj": True},
{"t": "mousedog bird", "adj": False}
]
}
with a token that has either a space or a punctuation mark, then collate_pretokenized_json crashes like this:
Error
Traceback (most recent call last):
File "/Users/tla/Projects/collatex/collatex-pythonport/tests/test_witness_tokens.py", line 42, in testPretokenizedWitness
result = collate_pretokenized_json(pretokenized_witness)
File "/Users/tla/Projects/collatex/collatex-pythonport/collatex/core_functions.py", line 69, in collate_pretokenized_json
new_row.cells.append(tokenized_witness[token_counter])
IndexError: list index out of range
This is because the pretokenized witness gets concatenated together and then re-tokenized on whitespace and punctuation. It shouldn't be doing that in the first place.
I put this fix on a branch because it does change the API a little. Collation.add_witness
now takes a dict structure that looks (coincidentally enough) like the JSON specification of a witness (whether pretokenized or not). I have added a method, Collation.add_plain_witness
, that preserves the old behaviour, and I've changed the tests to use that instead. If you (@rhdekker) think the fix is fine, then you can merge it to master and close this.
Fix is fine, the responsibilities are better distributed over the classes in the patch. Merged into master.