Got exception using ocrd_detectron 2 with ocrd_all Release v2022-12-01
Closed this issue · 37 comments
I have got an exception using ocrd-detectron2-segment
as follows - please clarify (I can provide workspace, if needed):
(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_DocBank_X101.json
13:14:28.481 INFO processor.Detectron2Segment - Using compute device cpu
13:14:28.482 INFO processor.Detectron2Segment - Loading config '/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml'
Traceback (most recent call last):
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
sys.exit(ocrd_detectron2_segment())
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/cli.py", line 9, in ocrd_detectron2_segment
return ocrd_cli_wrap_processor(Detectron2Segment, *args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 82, in run_processor
parameter=parameter
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 91, in __init__
self.setup()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 116, in setup
cfg.merge_from_file(temp_config)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 46, in merge_from_file
loaded_cfg = self.load_yaml_with_base(cfg_filename, allow_unsafe=allow_unsafe)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 61, in load_yaml_with_base
cfg = yaml.safe_load(f)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 125, in safe_load
return load(stream, SafeLoader)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 79, in load
loader = Loader(stream)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/loader.py", line 34, in __init__
Reader.__init__(self, stream)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 85, in __init__
self.determine_encoding()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 124, in determine_encoding
self.update_raw()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 178, in update_raw
data = self.stream.read(size)
File "/usr/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 10: invalid start byte
Can you please show the contents of your model file /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
, and describe how you got (downloaded) it?
It is a VERY BIG file:
(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ll /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
-rw-rw-r-- 1 ocrdadmin ocrdadmin 783884362 Dec 2 10:32 /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
And, I have not downloaded in advance (I have thought, that this is done automatically, when I use -p
).
--> So, maybe I should do ?
ocrd resmgr download ocrd-detectron2-segment DocBank_X101.yaml
ocrd resmgr download ocrd-detectron2-segment DocBank_X101.pth
And, now it see the output of ocrd-detectron2-segment -L
is a bit strange (I only would expect JSON files, but I can see also pth/yaml files ?!):
(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ocrd-detectron2-segment -L
/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.pth
/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_DocBank_X101.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_Math_R50.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R101.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R101_JPLeoRX.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50_JPLeoRX.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_X101.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_TableBank_X152.json
And, I have not downloaded in advance (I have thought, that this is done automatically, when I use -p).
No, we have a PR for that OCR-D/core#799 but it got delayed because it is difficult to test.
To me it looks like the data got corrupted during download.
Try
ocrd resmgr download --overwrite ocrd-detectron2-segment DocBank_X101.yaml
And, now it see the output of
ocrd-detectron2-segment -L
is a bit strange
No, that one seems correct.
I also believe some earlier download attempt must have been corrupted.
Has not helped - still VERY BIG file.
I assume this in depending of this zip-Source-File.
See here:
(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials$ ocrd resmgr download --overwrite ocrd-detectron2-segment DocBank_X101.yaml
14:30:19.029 INFO ocrd.cli.resmgr - Downloading registered resource 'DocBank_X101.yaml' (https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip)
[------------------------------------] 0%14:30:22.528 INFO ocrd.resource_manager._download_impl - Downloading https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip to /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
[####################################] 100%
14:36:41.449 INFO ocrd.cli.resmgr - Installed resource https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip under /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
14:36:41.449 INFO ocrd.cli.resmgr - Use in parameters as 'DocBank_X101.yaml'
(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials$ ll /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
-rw-rw-r-- 1 ocrdadmin ocrdadmin 783884362 Dec 5 14:36 /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
Has not helped - still VERY BIG file.
That is to be expected, it's a ZIP file containing a huge neural network (model itself is 797 MiB).
But does it work now with the processor, i.e. has the bitflip been corrected by redownloading?
Sorry, forgot to mention: I still get the same Exception:
(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_DocBank_X101.json
14:51:05.690 INFO processor.Detectron2Segment - Using compute device cpu
14:51:05.690 INFO processor.Detectron2Segment - Loading config '/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml'
Traceback (most recent call last):
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
sys.exit(ocrd_detectron2_segment())
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/cli.py", line 9, in ocrd_detectron2_segment
return ocrd_cli_wrap_processor(Detectron2Segment, *args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 82, in run_processor
parameter=parameter
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 91, in __init__
self.setup()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 116, in setup
cfg.merge_from_file(temp_config)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 46, in merge_from_file
loaded_cfg = self.load_yaml_with_base(cfg_filename, allow_unsafe=allow_unsafe)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 61, in load_yaml_with_base
cfg = yaml.safe_load(f)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 125, in safe_load
return load(stream, SafeLoader)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 79, in load
loader = Loader(stream)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/loader.py", line 34, in __init__
Reader.__init__(self, stream)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 85, in __init__
self.determine_encoding()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 124, in determine_encoding
self.update_raw()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 178, in update_raw
data = self.stream.read(size)
File "/usr/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 10: invalid start byte
Has not helped - still VERY BIG file.
That is to be expected, it's a ZIP file containing a huge neural network (model itself is 797 MiB).
But does it work now with the processor, i.e. has the bitflip been corrected by redownloading?
This is still strange for me, as I would expect to get the unzipped-yaml file (which should a be very small text file)
Now, I have used ocrd-detectron2-segement
with resources which are NOT in Zip-File.
And, I have got a different exception:
ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.js
Traceback (most recent call last):
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
sys.exit(ocrd_detectron2_segment())
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1054, in main
with self.make_context(prog_name, args, **extra) as ctx:
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 920, in make_context
self.parse_args(ctx, args)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1378, in parse_args
value, args = param.handle_parse_result(ctx, opts, args)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 2360, in handle_parse_result
value = self.process_value(ctx, value)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 2322, in process_value
value = self.callback(ctx, self, value)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/parameter_option.py", line 8, in _handle_param_option
return parse_json_string_or_file(*list(value))
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_utils/str.py", line 179, in parse_json_string_or_file
raise err # pylint: disable=raising-bad-type
ValueError: Error parsing '/home/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.js': Expecting value: line 1 column 1 (char 0)
@kba , @bertsky : If you like, we can do a VC, where I can show this directly ...
ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.js
you misspelled. json
not js
oops :-(
But does it work now with the processor, i.e. has the bitflip been corrected by redownloading?
This is still strange for me, as I would expect to get the unzipped-yaml file (which should a be very small text file)
indeed, it should. Trying to reproduce with most recent version of ocrd_detectron2 (or did you say most recent version of ocrd_all?)...
But does it work now with the processor, i.e. has the bitflip been corrected by redownloading?
This is still strange for me, as I would expect to get the unzipped-yaml file (which should a be very small text file)
indeed, it should. Trying to reproduce with most recent version of ocrd_detectron2 (or did you say most recent version of ocrd_all?)...
Most recent version of ocrd_all
(NOT your new one of ocrd-detectron2-segement
)
ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.js
you misspelled.
json
notjs
Sorry, next try - but still not working:
ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.json
15:12:37.763 INFO processor.Detectron2Segment - Using compute device cpu
15:12:37.763 INFO processor.Detectron2Segment - Loading config '/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/NewspaperNavigator_R_50_PFPN_3x.yaml'
Traceback (most recent call last):
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
sys.exit(ocrd_detectron2_segment())
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/cli.py", line 9, in ocrd_detectron2_segment
return ocrd_cli_wrap_processor(Detectron2Segment, *args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 82, in run_processor
parameter=parameter
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 91, in __init__
self.setup()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 116, in setup
cfg.merge_from_file(temp_config)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 46, in merge_from_file
loaded_cfg = self.load_yaml_with_base(cfg_filename, allow_unsafe=allow_unsafe)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 61, in load_yaml_with_base
cfg = yaml.safe_load(f)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 125, in safe_load
return load(stream, SafeLoader)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 81, in load
return loader.get_single_data()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/constructor.py", line 49, in get_single_data
node = self.get_single_node()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/composer.py", line 58, in compose_document
self.get_event()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/parser.py", line 118, in get_event
self.current_event = self.state()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/parser.py", line 193, in parse_document_end
token = self.peek_token()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/scanner.py", line 129, in peek_token
self.fetch_more_tokens()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/scanner.py", line 223, in fetch_more_tokens
return self.fetch_value()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/scanner.py", line 579, in fetch_value
self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
in "/tmp/tmp_jjbvt9h/configs/NewspaperNavigator_R_50_PFPN_3x.yaml", line 19, column 28
I can confirm the extraction of the zip-file does not work with resmgr in core v2.43. There's the correct path_in_archive
setting, but nothing gets extracted, the file just gets renamed. @kba perhaps my assumption that non-empty path_in_archive
would imply type=archive
does not hold?
in "/tmp/tmp_jjbvt9h/configs/NewspaperNavigator_R_50_PFPN_3x.yaml", line 19, column 28
sry, the URL does not work with wget. It seems Dropbox forces you to interact with the download button, which yields a temporary download link. Too bad. What should we do?
in "/tmp/tmp_jjbvt9h/configs/NewspaperNavigator_R_50_PFPN_3x.yaml", line 19, column 28
sry, the URL does not work with wget. It seems Dropbox forces you to interact with the download button, which yields a temporary download link. Too bad. What should we do?
I will try out /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50.json
instead ...
in "/tmp/tmp_jjbvt9h/configs/NewspaperNavigator_R_50_PFPN_3x.yaml", line 19, column 28
sry, the URL does not work with wget. It seems Dropbox forces you to interact with the download button, which yields a temporary download link. Too bad. What should we do?
me bad. The problem was that I misspelled the URL in the tool json (&
instead of ?
for the URL args).
Fixed on master.
@stefanCCS can you please try again (both examples) after updating (in the usual way, i.e. git pull of the submodule, then remake ocrd-detectron2-segment in the main module)?
With ocrd-detectron2-segment
I get
...
15:42:19.227 INFO ocrd.cli.resmgr - Use in parameters as 'PubLayNet_R_50_FPN_3x_JPLeoRX.pth'
15:42:22.635 INFO processor.Detectron2Segment - Using compute device cpu
15:42:22.636 ERROR ocrd.ocrd-detectron2-segment.resolve_resource - Could not find resource 'PubLayNet_R_50_FPN_3x.yaml' for ...
Which might be related to:
From ocrd resmgr list-available -e ocrd-detectron2-segment
I get:
- PubLayNet_R_50_FPN_3x_JPLeoRX.yaml (https://github.com/facebookresearch/detectron2/raw/main/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml)
PubLayNet via JPLeoRX R50-FPN config
- PubLayNet_R_50_FPN_3x_JPLeoRX.pth (https://keybase.pub/jpleorx/detectron2-publaynet/mask_rcnn_R_50_FPN_3x/model_final.pth)
PubLayNet via JPLeoRX R50-FPN weights
...
Which might be not exactly the same name as for the JSON: presets_PubLayNet_R50.json
?
@kba perhaps my assumption that non-empty
path_in_archive
would implytype=archive
does not hold?
No, it does not, the type
defaults to file
. This assumption had not occured to me, it might be possible to hack that in, but I think being explicit about the type is better in any case.
@kba perhaps my assumption that non-empty
path_in_archive
would implytype=archive
does not hold?No, it does not, the
type
defaults tofile
. This assumption had not occured to me, it might be possible to hack that in, but I think being explicit about the type is better in any case.
Yes, and it was ill-conceived to begin with. Just because the URL is a zip-file does not mean the resource itself must be. On the contrary, path_in_archive
is simply one level deeper, so theoretically it could be an archive within an archive.
But that changes the question to: why did resmgr download not extract the file in the first place??
Yes, and it was ill-conceived to begin with. Just because the URL is a zip-file does not mean the resource itself must be. On the contrary,
path_in_archive
is simply one level deeper, so theoretically it could be an archive within an archive.
Or am I getting even more confused now? What should resmgr care if a file is an archive or not, except for the purpose of extracting it at install-time?
No, it does not, the
type
defaults tofile
. This assumption had not occured to me, it might be possible to hack that in, but I think being explicit about the type is better in any case.Yes, and it was ill-conceived to begin with. Just because the URL is a zip-file does not mean the resource itself must be. On the contrary,
path_in_archive
is simply one level deeper, so theoretically it could be an archive within an archive.But that changes the question to: why did resmgr download not extract the file in the first place??
Because it did not know that it was an archive, so downloaded it and was done for the day.
Or am I getting even more confused now? What should resmgr care if a file is an archive or not, except for the purpose of extracting it at install-time?
The type
attribute is semantically imprecise. The file vs. directory
distinction is relevant for listing the resources on the disk, the file/directory vs archive
is relevant for installation. It would have been better to distinguish "source type" (what is it we're downloading/copying) and "target type" (how should it be stored and listed).
ok, thanks for clarification! So it is correct now (on master).
Alas:
[------------------------------------] 0%16:57:25.801 INFO ocrd.resource_manager._download_impl - Downloading https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip to download.tar.xx
[####################################] 100%17:01:19.093 INFO ocrd.resource_manager.download - Extracting archive to /tmp/tmpnm6kkfv9/out
...
tarfile.ReadError: file could not be opened successfully
Also ambiguous: the size
parameter. It is not clear whether this applies to the (extracted) file or the (zipped) download. The implemented progress bar seems to indicate the latter, but the documentation just says size of the resource in bytes
.
Downloading https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip to download.tar.xx
@kba it seems that zip files have never been supported in resmgr to date. Should I open an issue?
@kba it seems that zip files have never been supported in resmgr to date. Should I open an issue?
Yeah, that's why the type was originally tarball
. I have opened OCR-D/core#963 for this.
Also ambiguous: the
size
parameter. It is not clear whether this applies to the (extracted) file or the (zipped) download. The implemented progress bar seems to indicate the latter, but the documentation just sayssize of the resource in bytes
.
It's only used for the download bar. OCR-D/spec#233
After updating detectron2
module, I could let run ocrd-detectron2-segment
without any errors, using model "TableBank_X152".
I still have troubles using model "PubLayNet_R_50_FPN_3x_JPLeoRX", where I get the following error (some name mismatching):
15:15:44.203 INFO ocrd.cli.resmgr - Downloading registered resource 'PubLayNet_R_50_FPN_3x_JPLeoRX.yaml' (https://github.com/facebookresearch/detectron2/raw/main/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml)
15:15:47.575 INFO ocrd.resource_manager.download - https://github.com/facebookresearch/detectron2/raw/main/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml to be downloaded to /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.yaml which already exists and overwrite is False
15:15:47.616 INFO ocrd.cli.resmgr - Installed resource https://github.com/facebookresearch/detectron2/raw/main/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml under /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.yaml
15:15:47.617 INFO ocrd.cli.resmgr - Use in parameters as 'PubLayNet_R_50_FPN_3x_JPLeoRX.yaml'
15:15:52.324 INFO ocrd.cli.resmgr - Downloading registered resource 'PubLayNet_R_50_FPN_3x_JPLeoRX.pth' (https://keybase.pub/jpleorx/detectron2-publaynet/mask_rcnn_R_50_FPN_3x/model_final.pth)
15:15:55.692 INFO ocrd.resource_manager.download - https://keybase.pub/jpleorx/detectron2-publaynet/mask_rcnn_R_50_FPN_3x/model_final.pth to be downloaded to /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.pth which already exists and overwrite is False
15:15:55.732 INFO ocrd.cli.resmgr - Installed resource https://keybase.pub/jpleorx/detectron2-publaynet/mask_rcnn_R_50_FPN_3x/model_final.pth under /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.pth
15:15:55.732 INFO ocrd.cli.resmgr - Use in parameters as 'PubLayNet_R_50_FPN_3x_JPLeoRX.pth'
15:15:58.839 INFO processor.Detectron2Segment - Using compute device cpu
15:15:58.840 ERROR ocrd.ocrd-detectron2-segment.resolve_resource - Could not find resource 'PubLayNet_R_50_FPN_3x.yaml' for executable 'ocrd-detectron2-segment'. Try 'ocrd resmgr download ocrd-detectron2-segment PubLayNet_R_50_FPN_3x.yaml' to download this resource.
ERROR from called application: ExitCode=1
--> Maybe the json-Preset-File is not correct?
And/or another try:
I made:
ocrd resmgr download ocrd-detectron2-segment PubLayNet_R_50_FPN_3x_JPLeoRX.yaml
ocrd resmgr download ocrd-detectron2-segment PubLayNet_R_50_FPN_3x_JPLeoRX.pth
ocrd-detectron2-segment -I OCR-D-BIN -O OCR-D-DETECTRON2-PubLayNet_R50_JPLeoRX -p /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50_JPLeoRX.json
And got this:
(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ocrd-detectron2-segment -I OCR-D-BIN -O OCR-D-DETECTRON2-PubLayNet_R50_JPLeoRX -p /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50_JPLeoRX.json
15:42:19.806 INFO processor.Detectron2Segment - Using compute device cpu
15:42:19.806 INFO processor.Detectron2Segment - Loading config '/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.yaml'
Traceback (most recent call last):
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
sys.exit(ocrd_detectron2_segment())
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/cli.py", line 9, in ocrd_detectron2_segment
return ocrd_cli_wrap_processor(Detectron2Segment, *args, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 82, in run_processor
parameter=parameter
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 92, in __init__
self.setup()
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 117, in setup
cfg.merge_from_file(temp_config)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 46, in merge_from_file
loaded_cfg = self.load_yaml_with_base(cfg_filename, allow_unsafe=allow_unsafe)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 103, in load_yaml_with_base
base_cfg = _load_with_base(base_cfg_file)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 93, in _load_with_base
return cls.load_yaml_with_base(base_cfg_file, allow_unsafe=allow_unsafe)
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 59, in load_yaml_with_base
with cls._open_cfg(filename) as f:
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 34, in _open_cfg
return PathManager.open(filename, "r")
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/iopath/common/file_io.py", line 1012, in open
bret = handler._open(path, mode, buffering=buffering, **kwargs) # type: ignore
File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/iopath/common/file_io.py", line 612, in _open
opener=opener,
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp026q1q7v/Base-RCNN-FPN.yaml'
--> Please clarify.
I get the same, if I manually unzip
"DocBank_X101" and copy the files to:
/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.pth
resp. .yaml
while using
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_DocBank_X101.json
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp026q1q7v/Base-RCNN-FPN.yaml'
This seems to be an issue with PubLayNet_R_50_FPN_3x_JPLeoRX.yaml
, which has this
_BASE_: "../Base-RCNN-FPN.yaml"
which should be
_BASE_: "../configs/Base-RCNN-FPN.yaml"
I think. That solves the FileNotFoundError for me.
Unfortunately, this still gives me AssertionError: The chosen model's number of classes 80 does not match the given list of categories
.
So I think this is an issue with the third-party models themselves, not ocrd_detectron2.
@kba: Concerning Base-RCNN-FPN.yaml
.
If I search for it, I find it 5 times.
In which path it is search for? (I just want to put a softlink there...)
(ocrd-3.7) ocrdadmin@ocrd-03:~$ find . -name Base-RCNN-FPN.yaml
./ocrd-3.7_rel_2022-11-10/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
./ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
./ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
./ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
./ocrd-3.7_rel_2022-11-24/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
Well, I have put a soft link in all five places - as you can see here:
(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ find ~ -name Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-10/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-10/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml
--> unfortunately, this has no worked :-(
I have create a softlink for a whole config folder in temp like this
(ocrd-3.7) ocrdadmin@ocrd-03:/tmp$ ln -s /home/ocrdadmin/ocrd-3.7_rel_2022-11-10/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/ configs
--> I have made this, because my error I have got now always was this:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/configs/Base-RCNN-FPN.yaml'
--> this has worked, but I do not understand, if this is always the case, or in general, what is the logic behind.
--> especially, if I look up in this issue where the error was like that (with a random path !):
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp026q1q7v/Base-RCNN-FPN.yaml'
--> So, what is the general solution?
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp026q1q7v/Base-RCNN-FPN.yaml'
This seems to be an issue with
PubLayNet_R_50_FPN_3x_JPLeoRX.yaml
, which has this_BASE_: "../Base-RCNN-FPN.yaml"which should be
_BASE_: "../configs/Base-RCNN-FPN.yaml"I think. That solves the FileNotFoundError for me.
Yes, some model providers make crazy assumptions on where in the original Detectron2 repo your CWD is. That's why I already have to do a temporary shutil.copytree
into the Detectron2 distribution. I'll make an additional workaround for this case in the loader code, so we won't have to manually fix the config files (which I did for myself in the past).
Unfortunately, this still gives me
AssertionError: The chosen model's number of classes 80 does not match the given list of categories
.So I think this is an issue with the third-party models themselves, not ocrd_detectron2.
Indeed, this particular config is even worse. I took it from https://github.com/JPLeoRX/detectron2-publaynet. They help themselves by using the vanilla COCO config (which is for photo scenery, not for PubLayNet document images), but overriding the NUM_CLASSES
at runtime.
Since this applies to all JPLeoRX's models, and they are trained on PubLayNet no other than hpanwar08's, I think it would suffice to just switch over to those configs. I'll make a fix.
the PubLayNet/JPLeoRX models should be fixed with 07fbdbf now. @stefanCCS could you please reinstall, redownload and try again?
@kba , @bertsky :
As usual, I would prefer just to have a new release of ocrd_all
.
Having fixed this issue #15 and and also #18 and also something related to OCR-D/core#970
Will this be available in the next future?
Sure, the next ocrd_all will certainly update ocrd_detectron2 to 0.1.5. (You'll still need to run sudo make deps-ubuntu
again though, since I doubt we will switch to Python 3.7 / Ubuntu 20.04 so quickly.)