termsuite/termsuite-core

TeiCollectionReader does not accept .tei files

Closed this issue · 4 comments

As stated line 77 of termsuite-core/src/main/java/eu/project/ttc/readers/TeiCollectionReader.java, the FilenameFilter only accepts files with an extension of .xml, while the documentation clearly shows TEI files with a .tei extension.

dcram commented

You are right, Thank you ! Would you suggest there is no file extension constraint on TeiCollectionReader ? Or maybe a *.(tei|xml) ? Or set the constraint to *.tei to match the documentation ?

I'd suggest to accept both .tei and .xml, as naming TEI files .xml makes sense. Maybe, when no files are matched, output a warning such as "No file found with extension {extension} in input directory."

dcram commented

Another comment, TEI support is currently broken in TermSuite (see issue #24) since I can't find any event-based xml parser (SAX does not fit) that gives the right begin/end offsets.

Fair enough, I'll stick to TXT for now then I suppose