Special characters and punctuation in ACE chunking
Aatlantise opened this issue · 1 comments
Aatlantise commented
Hello,
I've noticed some special characters and words attached to them get omitted in ACE chunking, as seen below:
"text": "This is an experiment: how do special chars & punctuations--like ~ (tilde) or * (star)--behave in ACE? #science",
"chunk_str": "<This> <is> <an experiment> <how> <special chars & punctuations--like> <~> <*> <in> <ACE> .",
"text": "This is an experiment how do special chars and punctuations like tilde or star behave in ACE? science",
"chunk_str": "<This> <is> <an experiment> <how> <special chars and punctuations> <like> <tilde or star> <behave> <in> <ACE> <science> ."
Here, :, (, ), # seem to be culprits. For some reason, <do>
also disappears in both examples.
Would you have a complete list of such characters? I'm trying to create some kind of preprocessing module that would strip input sentences of them.
Much thanks!
wangxinyu0922 commented
Hi,
Does ``chunk_str'' mean the chunking output of the ACE model? Can you provide the exact input and the output file/screenshot for the problem you met? I think the code will not omit these characters.