jbothma/ontology-learning-protege

not all input files processed by korp

Opened this issue · 2 comments

$ ls export/| wc -l
1381

(jdb)-(09:02 AM Tue Feb 05)-(~/ol_experiment_1/preprocess/korp):
$ ls src/| wc -l
1701

INFO  korp: 011998 | 01.03.14: RUN: annotate_children(text='annotations/pda23658.cleantext_0057A.TEXT', child='annotations/pda23658.cleantext_0057A.token', parent='annotations/pda23658.cleantext_0057A.sentence', out='annotations/pda23658.cleantext_0057A.children.sentence.token')  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp: 011998 | 01.03.14: Read 1 chars, 2 anchors: annotations/pda23658.cleantext_0057A.TEXT  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp: 011998 | 01.03.14: Read 0 items: annotations/pda23658.cleantext_0057A.sentence  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp: 011998 | 01.03.14: Read 0 items: annotations/pda23658.cleantext_0057A.token  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp: Traceback (most recent call last):  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:     "__main__", fname, loader, pkg_name)  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:     exec code in run_globals  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:   File "/Users/jdb/bin/korp/annotate/python/sb/parent.py", line 90, in <module>  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:     children=annotate_children)  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:   File "/Users/jdb/bin/korp/annotate/python/sb/util/run.py", line 67, in main  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:     fun(**options)  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:   File "/Users/jdb/bin/korp/annotate/python/sb/parent.py", line 36, in annotate_children  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp:     parent_span, parent_id = parent_chunks.next()  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp: StopIteration  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]
INFO  korp: make: *** [annotations/pda23658.cleantext_0057A.children.sentence.token] Error 255  uk.co.jbothma.protege.protplug.preprocess.KorpPipeline[SwingWorker-pool-1-thread-4]

running korp on that file alone doesn't have the same exception

$ make TEXT
mkdir -p src/
cp -r -p original/pda23658.cleantext_0057A.xml src/pda23658.cleantext_0057A.xml
mkdir -p annotations/
python -m sb.fileid --out annotations/fileids --files "pda23658.cleantext_0057A"

002632 ________________________________________________________________________________
002632 | 18.14.47: RUN: fileid(files='pda23658.cleantext_0057A', out='annotations/fileids')
002632 | 18.14.47: Wrote 1 items: annotations/fileids
002632 | 18.14.47: Total time: 0.08 s
002632 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

mkdir -p annotations/
python -m sb.xmlparser --source src/pda23658.cleantext_0057A.xml --text annotations/pda23658.cleantext_0057A.TEXT --skip "" --elements "body p" --annotations "annotations/pda23658.cleantext_0057A.body annotations/pda23658.cleantext_0057A.paragraph" --overlap "" --fileid "pda23658.cleantext_0057A" --fileids annotations/fileids --header "" --headers "" --header_annotations "" --skip_if_empty "" --skip_entities "" --autoclose ""

002640 ________________________________________________________________________________
002640 | 18.14.48: RUN: parse(elements='body p', skip_if_empty='', header='', skip='', skip_entities='', overlap='', headers='', source='src/pda23658.cleantext_0057A.xml', autoclose='', text='annotations/pda23658.cleantext_0057A.TEXT', fileids='annotations/fileids', annotations='annotations/pda23658.cleantext_0057A.body annotations/pda23658.cleantext_0057A.paragraph', header_annotations='', fileid='pda23658.cleantext_0057A')
002640 | 18.14.48: Read 1 items: annotations/fileids
002640 | 18.14.48: Wrote 1 chars, 2 anchors: annotations/pda23658.cleantext_0057A.TEXT
002640 | 18.14.48: Wrote 0 items: annotations/pda23658.cleantext_0057A.body
002640 | 18.14.48: Wrote 0 items: annotations/pda23658.cleantext_0057A.paragraph
002640 | 18.14.48: Total time: 0.00 s
002640 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The file has 1 byte (0A, line feed).

It's not clear why an exceptionw as thrown.