interedition/collatex

Crash when trying to collate pretokenized witnesses with space or punctuation

Closed this issue · 2 comments

tla commented

If you have a witness that looks like this:

                {
                    "id": "B",
                    "tokens": [
                        {"t": "A"},
                        {"t": "white", "adj": True},
                        {"t": "mousedog bird", "adj": False}
                    ]
                }

with a token that has either a space or a punctuation mark, then collate_pretokenized_json crashes like this:

Error
Traceback (most recent call last):
  File "/Users/tla/Projects/collatex/collatex-pythonport/tests/test_witness_tokens.py", line 42, in testPretokenizedWitness
    result = collate_pretokenized_json(pretokenized_witness)
  File "/Users/tla/Projects/collatex/collatex-pythonport/collatex/core_functions.py", line 69, in collate_pretokenized_json
    new_row.cells.append(tokenized_witness[token_counter])
IndexError: list index out of range

This is because the pretokenized witness gets concatenated together and then re-tokenized on whitespace and punctuation. It shouldn't be doing that in the first place.

tla commented

I put this fix on a branch because it does change the API a little. Collation.add_witness now takes a dict structure that looks (coincidentally enough) like the JSON specification of a witness (whether pretokenized or not). I have added a method, Collation.add_plain_witness, that preserves the old behaviour, and I've changed the tests to use that instead. If you (@rhdekker) think the fix is fine, then you can merge it to master and close this.

Fix is fine, the responsibilities are better distributed over the classes in the patch. Merged into master.