UCDenver-ccp/CRAFT

Trouble rebuilding source text from CoNLL-U files

Closed this issue · 4 comments

We have build some code that in principle rebuilds the CRAFT articles source text form the CoNLL-U files. We have found that for some articles we don't get correct indexes in the bottom of a document.

F.ex. looking at "11897010.conllu" the word "hexamers" has begin position "14765" but when we rebuild from the CoNLL-U file we get instead "14763". One of the possible causes for this is that the CoNLL-U file does not represent multiple spaces between tokens but assumes there's always only one space. This is not the case though. See example below.

This is actually not a CRAFT problem but an issue with the CoNLL-U format. But is it something you have had issues with?

\Rune

# newpar id = 11897010-p40
# sent_id = 568
# text = We conducted database searches using BLAST  and Draft Human Genome Browser .
1	We	we	PRON	PRP	Case=Nom|Number=Plur|Person=1|PronType=Prs	2	nsubj	_	_
2	conducted	conduct	VERB	VBD	Tense=Past|VerbForm=Fin	0	root	_	_
3	database	database	NOUN	NN	Number=Sing	4	compound	_	_
4	searches	search	NOUN	NNS	Number=Plur	2	dobj	_	_
5	using	use	VERB	VBG	Aspect=Prog|Tense=Pres|VerbForm=Part	2	xcomp	_	_
6	BLAST	blast	PROPN	NNP	Number=Sing	5	dobj	_	_
7	and	and	CCONJ	CC	_	6	cc	_	_
8	Draft	draft	PROPN	NNP	Number=Sing	11	compound	_	_
9	Human	human	PROPN	NNP	Number=Sing	10	compound	_	_
10	Genome	genome	PROPN	NNP	Number=Sing	11	compound	_	_
11	Browser	browser	PROPN	NNP	Number=Sing	6	conj	_	_
12	.	.	PUNCT	.	PunctType=Peri	2	punct	_	_

I have not tried rebuilding the text documents from the CoNLL-U files specifically, but I have mapped tokens from the CoNLL-U files back to the text documents in order to get the token spans. You are correct (as your example demonstrates) that there are some extra spaces in the text documents. These are artifacts of extracting the text from the original PubMed Central XML files. My guess for the example above is that there was a reference for BLAST that was removed, leading to two consecutive spaces.

I found yet another issue. At least one place in this document the CoNLL-U states a paragraph where it does not exist:

CoNLL:
# newpar id = 11897010-p19
# sent_id = 510
# text = TMPred analysis  predicts a protein structure that is nearly identical to MCOLN1, containing 6 transmembrane domains with the N- and C-termini residing in the cytoplasm (Fig. 2) [9].

Source:
txt - “Interestingly, two MCOLN1 amino acid substitutions that result in MLIV occur at conserved amino acids. TMPred analysis  predicts …”
xml - ” result in MLIV occur at conserved amino acids. TMPred analysis <ext-link ext-link-type=“uri” xlink:href=“” (edited)

Thank you for pointing this out. I agree that there should not be a new paragraph there.

We have now created a script that converts the CoNLL-U files to UIMA CAS XMI. This script also corrects the various offset/paragraph errors so the begin/end tags in the CAS XMI files corresponds to the source text.

All relevant converted files can be downloaded from here:

https://github.com/unsiloai/CRAFT-UNSILO