Tests broken since last update
Closed this issue · 5 comments
Since the last update, the tests are broken:
------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------
11:00:07.844 INFO processor.CalamariRecognize - INPUT FILE 0 / phys_0001
--------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------
INFO processor.CalamariRecognize:recognize.py:81 INPUT FILE 0 / phys_0001
================================================================== short test summary info ==================================================================
FAILED test/test_recognize.py::test_recognize - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you...
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model - requests.exceptions.MissingSchema: Invalid URL 'OC...
FAILED test/test_recognize.py::test_word_segmentation - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Per...
FAILED test/test_recognize.py::test_glyphs - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you me...
==================================================================== 4 failed in 16.04s =====================================================================
make: *** [Makefile:77: test] Error 1
Observations:
The new code from @bertsky's change in 1f0252d should download OCR-D-IMG/INPUT_0017.tif
but doesn't:
% ls /tmp/test-ocrd-calamari/OCR-D-IMG
OCR-D-IMG_0001.tif OCR-D-IMG_0002.tif
The "downloaded" images' filenames are made from the mets:file
's ID
:
<mets:fileGrp USE="OCR-D-IMG">
<mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0001">
<mets:FLocat LOCTYPE="URL" xlink:href="OCR-D-IMG/INPUT_0017.tif"/>
</mets:file>
<mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0002">
<mets:FLocat LOCTYPE="URL" xlink:href="OCR-D-IMG/INPUT_0020.tif"/>
</mets:file>
</mets:fileGrp>
With an old(!) checkout of test/assets I did not have these fails with this new code, so this may be worth investigating.
With an old(!) checkout of test/assets
See also #72.
I think this is caused by a change in assets: OCR-D/assets@b12e5eb, which was supposed to fix OCR-D/assets#87, but does not work.
Here is a debug log of what actually happens when copying the workspace to a temporary location:
DEBUG ocrd.resolver.workspace_from_url:resolver.py:164 workspace_from_url
mets_basename='mets.xml'
mets_url='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml'
src_baseurl='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data'
dst_dir='/tmp/test-ocrd-calamari'
DEBUG ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
DEBUG ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml' to '/tmp/test-ocrd-calamari/mets.xml'
DEBUG ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/> [_recursion_count=0]
DEBUG ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG ocrd.workspace.download_file:workspace.py:158 First run of resolver.download_to_directory(OCR-D-IMG/INPUT_0017.tif) failed, try prepending baseurl '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data': File path passed as 'url' to download_to_directory does not exist: OCR-D-IMG/INPUT_0017.tif
DEBUG ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/> [_recursion_count=1]
DEBUG ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif' to '/tmp/test-ocrd-calamari/OCR-D-IMG/OCR-D-IMG_0001.tif'
So, essentially, Resolver.workspace_from_url
undoes the non-standard path names when downloading, and subsequently the @imageFilename
reference does not work (again).
@kba I suppose we could fix this in assets by using standard basenames, but it looks more like a bug in core to me.
Relevant parts of test_recognize.py
:
METS_KANT = assets.url_of('kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml')
WORKSPACE_DIR = '/tmp/test-ocrd-calamari'
resolver = Resolver()
workspace = resolver.workspace_from_url(METS_KANT, dst_dir=WORKSPACE_DIR)
for imgf in workspace.mets.find_files(fileGrp="OCR-D-IMG"):
imgf = workspace.download_file(imgf)
print(imgf)
This clones the workspace from test/assets
and doesn't give the correct local filenames:
<OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/OCR-D-IMG_0001.tif, local_filename=OCR-D-IMG/OCR-D-IMG_0001.tif]/>
<OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0002, mimetype=image/tiff, url=OCR-D-IMG/OCR-D-IMG_0002.tif, local_filename=OCR-D-IMG/OCR-D-IMG_0002.tif]/>