Crash when trying to collate pretokenized witnesses with space or punctuation

Question

Crash when trying to collate pretokenized witnesses with space or punctuation

Closed this issue 10 years ago · 2 comments

If you have a witness that looks like this:

                {
                    "id": "B",
                    "tokens": [
                        {"t": "A"},
                        {"t": "white", "adj": True},
                        {"t": "mousedog bird", "adj": False}
                    ]
                }

with a token that has either a space or a punctuation mark, then collate_pretokenized_json crashes like this:

Error
Traceback (most recent call last):
  File "/Users/tla/Projects/collatex/collatex-pythonport/tests/test_witness_tokens.py", line 42, in testPretokenizedWitness
    result = collate_pretokenized_json(pretokenized_witness)
  File "/Users/tla/Projects/collatex/collatex-pythonport/collatex/core_functions.py", line 69, in collate_pretokenized_json
    new_row.cells.append(tokenized_witness[token_counter])
IndexError: list index out of range

This is because the pretokenized witness gets concatenated together and then re-tokenized on whitespace and punctuation. It shouldn't be doing that in the first place.

Answer 1 · 2014-11-21T00:01:07.000Z

I put this fix on a branch because it does change the API a little. Collation.add_witness now takes a dict structure that looks (coincidentally enough) like the JSON specification of a witness (whether pretokenized or not). I have added a method, Collation.add_plain_witness, that preserves the old behaviour, and I've changed the tests to use that instead. If you (@rhdekker) think the fix is fine, then you can merge it to master and close this.

Answer 2 · 2014-11-23T12:21:24.000Z

Fix is fine, the responsibilities are better distributed over the classes in the patch. Merged into master.