not all input files processed by korp
Opened this issue · 2 comments
jbothma commented
$ ls export/| wc -l
1381
(jdb)-(09:02 AM Tue Feb 05)-(~/ol_experiment_1/preprocess/korp):
$ ls src/| wc -l
1701
jbothma commented
INFO korp: 011998 | 01.03.14: RUN: annotate_children(text='annotations/pda23658.cleantext_0057A.TEXT', child='annotations/pda23658.cleantext_0057A.token', parent='annotations/pda23658.cleantext_0057A.sentence', out='annotations/pda23658.cleantext_0057A.children.sentence.token') uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: 011998 | 01.03.14: Read 1 chars, 2 anchors: annotations/pda23658.cleantext_0057A.TEXT uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: 011998 | 01.03.14: Read 0 items: annotations/pda23658.cleantext_0057A.sentence uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: 011998 | 01.03.14: Read 0 items: annotations/pda23658.cleantext_0057A.token uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: Traceback (most recent call last): uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: "__main__", fname, loader, pkg_name) uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: exec code in run_globals uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: File "/Users/jdb/bin/korp/annotate/python/sb/parent.py", line 90, in <module> uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: children=annotate_children) uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: File "/Users/jdb/bin/korp/annotate/python/sb/util/run.py", line 67, in main uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: fun(**options) uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: File "/Users/jdb/bin/korp/annotate/python/sb/parent.py", line 36, in annotate_children uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: parent_span, parent_id = parent_chunks.next() uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: StopIteration uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO korp: make: *** [annotations/pda23658.cleantext_0057A.children.sentence.token] Error 255 uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
jbothma commented
running korp on that file alone doesn't have the same exception
$ make TEXT
mkdir -p src/
cp -r -p original/pda23658.cleantext_0057A.xml src/pda23658.cleantext_0057A.xml
mkdir -p annotations/
python -m sb.fileid --out annotations/fileids --files "pda23658.cleantext_0057A"
002632 ________________________________________________________________________________
002632 | 18.14.47: RUN: fileid(files='pda23658.cleantext_0057A', out='annotations/fileids')
002632 | 18.14.47: Wrote 1 items: annotations/fileids
002632 | 18.14.47: Total time: 0.08 s
002632 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mkdir -p annotations/
python -m sb.xmlparser --source src/pda23658.cleantext_0057A.xml --text annotations/pda23658.cleantext_0057A.TEXT --skip "" --elements "body p" --annotations "annotations/pda23658.cleantext_0057A.body annotations/pda23658.cleantext_0057A.paragraph" --overlap "" --fileid "pda23658.cleantext_0057A" --fileids annotations/fileids --header "" --headers "" --header_annotations "" --skip_if_empty "" --skip_entities "" --autoclose ""
002640 ________________________________________________________________________________
002640 | 18.14.48: RUN: parse(elements='body p', skip_if_empty='', header='', skip='', skip_entities='', overlap='', headers='', source='src/pda23658.cleantext_0057A.xml', autoclose='', text='annotations/pda23658.cleantext_0057A.TEXT', fileids='annotations/fileids', annotations='annotations/pda23658.cleantext_0057A.body annotations/pda23658.cleantext_0057A.paragraph', header_annotations='', fileid='pda23658.cleantext_0057A')
002640 | 18.14.48: Read 1 items: annotations/fileids
002640 | 18.14.48: Wrote 1 chars, 2 anchors: annotations/pda23658.cleantext_0057A.TEXT
002640 | 18.14.48: Wrote 0 items: annotations/pda23658.cleantext_0057A.body
002640 | 18.14.48: Wrote 0 items: annotations/pda23658.cleantext_0057A.paragraph
002640 | 18.14.48: Total time: 0.00 s
002640 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The file has 1 byte (0A, line feed).
It's not clear why an exceptionw as thrown.