jakpra/treeconstructive-supertagging

Unknown Error in reformat.py

Closed this issue · 4 comments

When running reformat.py I get this message. It occurs on the sentence "This program of <<News Night Banquet>> on Jiangsu TV 's city channel , caused a great uproar in the community ." I'm guessing it is a formatting problem, but I don't know what the clean format should look like.

Traceback (most recent call last):
  File "/home/ajb341/anaconda3/lib/python3.7/site-packages/TreeST-1.0-py3.7.egg/tree_st/util/reader.py", line 194, in read_node
ValueError: not enough values to unpack (expected 5, got 4)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "scripts/reformat.py", line 49, in <module>
    main(args)
  File "scripts/reformat.py", line 28, in main
    for d in ds:
  File "/home/ajb341/anaconda3/lib/python3.7/site-packages/TreeST-1.0-py3.7.egg/tree_st/util/reader.py", line 428, in __next__
  File "/home/ajb341/anaconda3/lib/python3.7/site-packages/TreeST-1.0-py3.7.egg/tree_st/util/reader.py", line 444, in next
  File "/home/ajb341/anaconda3/lib/python3.7/site-packages/TreeST-1.0-py3.7.egg/tree_st/util/reader.py", line 416, in next
  File "/home/ajb341/anaconda3/lib/python3.7/site-packages/TreeST-1.0-py3.7.egg/tree_st/util/reader.py", line 222, in read
  File "/home/ajb341/anaconda3/lib/python3.7/site-packages/TreeST-1.0-py3.7.egg/tree_st/util/reader.py", line 211, in rec_read
  File "/home/ajb341/anaconda3/lib/python3.7/site-packages/TreeST-1.0-py3.7.egg/tree_st/util/reader.py", line 207, in rec_read
  File "/home/ajb341/anaconda3/lib/python3.7/site-packages/TreeST-1.0-py3.7.egg/tree_st/util/reader.py", line 196, in read_node
ValueError: (('not enough values to unpack (expected 5, got 4)',), 1, 16, ' N X X Banquet')

I'm pretty sure the problem is caused by the angled brackets. If you could remove or replace them with a placeholder, that probably resolves it. Thanks for finding this - I either have to mention in the readme that certain characters should not occur in the text, or add preprocessing code that deals with them.

Is any use of '<' or '>' forbidden? Are there any other special characters? Will html tags like this also cause a problem? <a_href="http://mediamatters.org/research/200809020015" > ?

Is any use of '<' or '>' forbidden?
Will html tags like this also cause a problem?

In the current version of the code, yes, because '<' and '>' are part of the CCGbank-specific format that the initial output file is written in. This is of course not a very striking reason as I could either allow alternate output formats or escape and unescape these characters as part of the pre- and postprocessing code, so that users don't have to deal with it.
However, there is another consideration to be made, namely: Do we expect the model to be well-equipped to assign the proper CCG tags to such tokens as "<<News" and html tags? And I think the answer is no (at least not the one trained on CCGbank). "<<News" should really be two tokens, where the first one, "<<", is equivalent to (French) double quotes. This could be normalized in dataset-specific preprocessing, as could html tags (e.g., replaced with a dummy token "-HTML-", or (temporarily) removed entirely - I'd expect these are metadata that shouldn't really be part of the parse anyways).

Are there any other special characters?

Not that I'm aware of right now, but it's not impossible. Everything that occurs in CCGbank (parentheses etc.) should be properly dealt with, and as mentioned above, any non-standard characters that don't occur in CCGbank are going to be challenging for the model anyways.

If this continues to be a major issue, though, let me know, and I'm sure we can find a workaround.

Thanks. It looks like '<' and '>' are the only characters that cause a problem. Any preprocessing code can just replace these with placeholders.