elifesciences/sciencebeam-parser

RuntimeError: OSError: [Errno 2] No such file or directory [while running 'Map(<functools.partial object at 0x7f367367bdb8>)']

Closed this issue · 15 comments

What am I doing wrong? I installed sciencebeam a month ago and last week my vm got deleted, now I reinstalled the sb on it but I always get this:

python2 -m sciencebeam.examples.grobid_service_pdf_to_xml --input "/home/lopo/"
INFO:main:default_values: {'runner': 'FnApiRunner'}
INFO:main:parsed_args: Namespace(autoscaling_algorithm='NONE', cloud=False, grobid_action='/processHeaderDocument', grobid_url='http://localhost:8080/api', input='/home/lopo/', max_num_workers=10, num_workers=10, output_path='/home/lopo', output_suffix='.tei-header.xml', project=None, runner='FnApiRunner', setup_file='./setup.py', start_grobid_service=True, xslt_path=None)
INFO:root:==================== <function annotate_downstream_side_inputs at 0x7f3672f040c8> ====================
INFO:root:==================== <function fix_side_input_pcoll_coders at 0x7f3672f04758> ====================
INFO:root:==================== <function lift_combiners at 0x7f3672f042a8> ====================
INFO:root:==================== <function expand_gbk at 0x7f3672f041b8> ====================
INFO:root:==================== <function sink_flattens at 0x7f3672f04140> ====================
INFO:root:==================== <function greedily_fuse at 0x7f3672f047d0> ====================
INFO:root:==================== <function sort_stages at 0x7f3672f04848> ====================
INFO:root:Running (ref_AppliedPTransform__ReadFullFile/Read_3)+((ref_AppliedPTransform_Map(<functools.partial object at 0x7f367367bdb8>)_4)+((ref_AppliedPTransform_MapKeys_5)+(ref_AppliedPTransform_WriteToFile/Map()_7)))
INFO:root:start <DoOperation WriteToFile/Map() output_tags=['out']>
INFO:root:start <DoOperation MapKeys output_tags=['out']>
INFO:root:start <DoOperation Map(<functools.partial object at 0x7f367367bdb8>) output_tags=['out']>
INFO:root:start <ReadOperation _ReadFullFile/Read source=SourceBundle(weight=1.0, source=<sciencebeam.beam_utils.fileio._ReadFullFileSource object at 0x7f3672f6d4d0>, start_position=None, stop_position=None)>
INFO:sciencebeam.transformers.grobid_service_wrapper:grobid_service_instance: None
INFO:sciencebeam.transformers.grobid_service_wrapper:command_line: java -cp "/home/lopo/stuff/sciencebeam/.temp/grobid-service/lib/" org.grobid.service.main.GrobidServiceApplication
INFO:sciencebeam.transformers.grobid_service_wrapper:args: ['java', '-cp', '/home/tlopo/stuff/sciencebeam/.temp/grobid-service/lib/
', 'org.grobid.service.main.GrobidServiceApplication']
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/lopo/stuff/sciencebeam/sciencebeam/examples/grobid_service_pdf_to_xml.py", line 190, in
run()
File "/home/lopo/stuff/sciencebeam/sciencebeam/examples/grobid_service_pdf_to_xml.py", line 182, in run
configure_pipeline(p, known_args)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 410, in exit
self.run().wait_until_finish()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 390, in run
self.to_runner_api(), self.runner, self._options).run(False)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 403, in run
return self.runner.run_pipeline(self)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 218, in run_pipeline
return self.run_via_runner_api(pipeline.to_runner_api())
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 221, in run_via_runner_api
return self.run_stages(*self.create_stages(pipeline_proto))
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 859, in run_stages
pcoll_buffers, safe_coders).process_bundle.metrics
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 970, in run_stage
self._progress_frequency).process_bundle(data_input, data_output)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 1174, in process_bundle
result_future = self._controller.control_handler.push(process_bundle)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 1054, in push
response = self.worker.do_instruction(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 208, in do_instruction
request.instruction_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 230, in process_bundle
processor.process_bundle(instruction_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/bundle_processor.py", line 289, in process_bundle
op.start()
File "apache_beam/runners/worker/operations.py", line 243, in apache_beam.runners.worker.operations.ReadOperation.start
File "apache_beam/runners/worker/operations.py", line 244, in apache_beam.runners.worker.operations.ReadOperation.start
File "apache_beam/runners/worker/operations.py", line 253, in apache_beam.runners.worker.operations.ReadOperation.start
File "apache_beam/runners/worker/operations.py", line 175, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 85, in apache_beam.runners.worker.operations.ConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 403, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 404, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 569, in apache_beam.runners.common.DoFnRunner.receive
File "apache_beam/runners/common.py", line 577, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 618, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 575, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 353, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/core.py", line 973, in
wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
File "/home/lopo/stuff/sciencebeam/sciencebeam/transformers/grobid_service.py", line 52, in run_grobid_service
start_service_if_not_running()
File "/home/lopo/stuff/sciencebeam/sciencebeam/transformers/grobid_service.py", line 25, in start_service_if_not_running
service_wrapper.start_service_if_not_running()
File "/home/lopo/stuff/sciencebeam/sciencebeam/transformers/grobid_service_wrapper.py", line 98, in start_service_if_not_running
args, cwd=cwd, stdout=PIPE, stderr=subprocess.STDOUT
File "/usr/lib/python2.7/subprocess.py", line 394, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1047, in _execute_child
raise child_exception
RuntimeError: OSError: [Errno 2] No such file or directory [while running 'Map(<functools.partial object at 0x7f367367bdb8>)']

Hi, thank you for raising the issue and providing the log.

It is trying to start GROBID and that seems to fail. It was downloading the instance on-demand.
I moved away from that approach in favour of using Docker (the only downside is that it makes it more difficult with Google Dataflow).

Additionally I deprecated sciencebeam.examples.grobid_service_pdf_to_xml in favour of sciencebeam.pipeline_runners.beam_pipeline_runner.

So in your case I would recommend doing the following:

Install Docker.

Start GROBID using Docker: docker run --rm -p 8070:8070 lfoppiano/grobid:0.5.1

Using the old pipeline: python2 -m sciencebeam.examples.grobid_service_pdf_to_xml --grobid-url=http://localhost:8070/api --input "/home/lopo/*.pdf"

Or using the new pipeline: python2 -m sciencebeam.pipeline_runners.beam_pipeline_runner --grobid-url=http://localhost:8070/api --source-path "/home/lopo/*.pdf"

Please let me know whether that works for you.

(I'd also be interested in your use-case)

feq70 commented

Thanks, but when I use the old pipeline I get:

raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8070/api/processFulltextDocument [while running 'Map(<functools.partial object at 0x7fe79a257100>)']

and when using the new pipeline I get:
/usr/bin/python2: No module named pipeline_runners

It worked so far by installing the docker, but now I can't convert pdfs to xml anymore.

Thanks for helping me out.

If you are getting No module named pipeline_runners then you are either not using the latest source or your environment doesn't find the sciencebeam module. I just tried it myself.

Although I realised the --data-path parameter is required in combination with the new pipeline, e.g.: 'python2 -m sciencebeam.pipeline_runners.beam_pipeline_runner --grobid-url=http://localhost:8070/api --data-path "/home/lopo" --source-path "/home/lopo/*.pdf".

I am not quite sure why you would get 400 Client Error. It could be that one of the PDF files is corrupt.

You could also try to run ScienceBeam itself as a Docker container:

cd /home/lopo
docker pull elifesciences/sciencebeam
docker run --rm --net host -v"$(pwd):/data" elifesciences/sciencebeam \
  python -m sciencebeam.pipeline_runners.beam_pipeline_runner \
  --grobid-url=http://localhost:8070/api \
  --data-path "/data" \
  --source-path "/data/*.pdf" \
  --output-path "/data/output"

This will mount the current directory (/home/lopo) to /data within the container. And it will output to /data/output (i.e. /home/lopo/output on the host). (It assumes that GROBID is running)

(The same could be achieved using docker-compose which could manage starting GROBID as well)

feq70 commented

I entered this:

 cd /home/lopo
 docker pull elifesciences/sciencebeam
 docker run --rm --net host -v"$(pwd):/data" elifesciences/sciencebeam \

python -m sciencebeam.pipeline_runners.beam_pipeline_runner
--grobid-url=http://localhost:8070/api
--data-path "/data"
--source-path "/data/*.pdf"
--output-path "/data/output"

And got this:

python2 -m sciencebeam.pipeline_runners.beam_pipeline_runner --grobid-url=http://localhost:8070/api --data-path "/data" --source-path "/data/.pdf" --output-path "/data/output"
INFO:sciencebeam_gym.beam_utils.main:default_values: {'runner': 'DirectRunner'}
INFO:main:args: Namespace(autoscaling_algorithm='NONE', base_data_path='/data', cloud=False, data_path='/data', debug=False, grobid_action=None, grobid_url='http://localhost:8070/api', grobid_xslt_path='xslt/grobid-jats.xsl', job_name=None, job_name_suffix=None, limit=None, max_num_workers=1, no_grobid_pretty_print=False, no_grobid_xslt=False, num_workers=1, output_path='/data/output', output_suffix='.xml', pipeline=None, project=None, resume=False, runner='DirectRunner', setup_file='./setup.py', source_file_column='url', source_file_list=None, source_path='/data/
.pdf')
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/ubuntu/tos/install/sciencebeam/sciencebeam/pipeline_runners/beam_pipeline_runner.py", line 296, in
run()
File "/home/ubuntu/tos/install/sciencebeam/sciencebeam/pipeline_runners/beam_pipeline_runner.py", line 288, in run
configure_pipeline(p, args, pipeline, config)
File "/home/ubuntu/tos/install/sciencebeam/sciencebeam/pipeline_runners/beam_pipeline_runner.py", line 154, in configure_pipeline
steps = pipeline.get_steps(config, opt)
File "/home/ubuntu/tos/install/sciencebeam/sciencebeam/pipelines/init.py", line 45, in get_steps
for step in pipeline.get_steps(config, args)
File "/home/ubuntu/tos/install/sciencebeam/sciencebeam/pipelines/grobid_pipeline.py", line 86, in get_steps
pretty_print=not args.no_grobid_pretty_print
File "/home/ubuntu/tos/install/sciencebeam/sciencebeam/transformers/xslt.py", line 15, in xslt_transformer_from_file
ET.tostring(ET.parse(xslt_filename)),
File "src/lxml/etree.pyx", line 3425, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 637, in lxml.etree._raiseParseError
IOError: Error reading file 'xslt/grobid-jats.xsl': failed to load external entity "xslt/grobid-jats.xsl"

Do you why this fails?

Thanks

Are you positive that the following was entered as a single command? You do need the leading slashes to make sure it's a single command.

docker run --rm --net host -v"$(pwd):/data" elifesciences/sciencebeam \
  python -m sciencebeam.pipeline_runners.beam_pipeline_runner \
  --grobid-url=http://localhost:8070/api \
  --data-path "/data" \
  --source-path "/data/*.pdf" \
  --output-path "/data/output"

(The python command is meant to run within the container - there is no need to add 2 to the python command, which I can in your output but shouldn't appear if copied like above)

The docker pull elifesciences/sciencebeam was meant to make sure it's pulling the latest image.

feq70 commented

You were right. Now I entered the full command and get for all pdfs something similar to the one as earlier:

 ConnectionError: HTTPConnectionPool(host='localhost', port=8070): Max retries exceeded with url: /api/processFulltextDocument (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd4349d8cd0>: Failed to establish a new connection: [Errno 111] Connection refused',))

That sounds like either the GROBID container is not running or is not reachable because the --net host parameter is missing. You could try to telnet localhost 8070 to ensure it's reachable locally.

feq70 commented

You are right:
Trying 127.0.0.1...
telnet: Unable to connect to remote host: Connection refused

So is this working after starting the GROBID container?

feq70 commented

Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
------------------------------------------------ so this works great now.
But the output folder stays empty for some reason.
in gob folder there is data, in data there is ouput

 root@mol:~/gob$ sudo docker run --rm --net host -v"$(pwd):/data" elifesciences/sciencebeam \

python -m sciencebeam.pipeline_runners.beam_pipeline_runner
--grobid-url=http://localhost:8070/api
--data-path "/data"
--source-path "/data/.pdf"
--output-path "/data/output"
INFO:sciencebeam_gym.beam_utils.main:default_values: {'runner': 'DirectRunner'}
INFO:main:args: Namespace(autoscaling_algorithm='NONE', base_data_path='/data', cloud=False, data_path='/data', debug=False, grobid_action=None, grobid_url='http://localhost:8070/api', grobid_xslt_path='xslt/grobid-jats.xsl', job_name=None, job_name_suffix=None, limit=None, max_num_workers=1, no_grobid_pretty_print=False, no_grobid_xslt=False, num_workers=1, output_path='/data/output', output_suffix='.xml', pipeline=None, project=None, resume=False, runner='DirectRunner', setup_file='./setup.py', source_file_column='url', source_file_list=None, source_path='/data/
.pdf')
INFO:main:steps: [DocToPdfStep(DOC to PDF), FunctionPipelineStep(Convert to TEI), FunctionPipelineStep(TEI to JATS)]
INFO:root:==================== <function annotate_downstream_side_inputs at 0x7ff765027a28> ====================
INFO:root:==================== <function lift_combiners at 0x7ff765027b18> ====================
INFO:root:==================== <function expand_gbk at 0x7ff765027ed8> ====================
INFO:root:==================== <function sink_flattens at 0x7ff765027b90> ====================
INFO:root:==================== <function greedily_fuse at 0x7ff765027c08> ====================
INFO:root:==================== <function sort_stages at 0x7ff765027500> ====================
INFO:root:Running (ref_AppliedPTransform_Create/Read_3)+((ref_AppliedPTransform_PreventFusion/AddKey_5)+(PreventFusion/GroupByKey/Write))
INFO:root:start <DataOutputOperation PreventFusion/GroupByKey/Write >
INFO:root:start <DoOperation PreventFusion/AddKey output_tags=['out']>
INFO:root:start <ReadOperation Create/Read source=SourceBundle(weight=1.0, source=<apache_beam.transforms.core._CreateSource object at 0x7ff76495e6d0>, start_position=None, stop_position=None)>
INFO:root:finish <ReadOperation Create/Read source=SourceBundle(weight=1.0, source=<apache_beam.transforms.core._CreateSource object at 0x7ff76495e6d0>, start_position=None, stop_position=None), receivers=[ConsumerSet[Create/Read.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
INFO:root:finish <DoOperation PreventFusion/AddKey output_tags=['out'], receivers=[ConsumerSet[PreventFusion/AddKey.out0, coder=WindowedValueCoder[TupleCoder[LengthPrefixCoder[FastPrimitivesCoder], LengthPrefixCoder[FastPrimitivesCoder]]], len(consumers)=1]]>
INFO:root:finish <DataOutputOperation PreventFusion/GroupByKey/Write >
INFO:root:Running (((PreventFusion/GroupByKey/Read)+((ref_AppliedPTransform_PreventFusion/Ungroup_10)+((ref_AppliedPTransform_ReadFileContent/Map()_12)+(ref_AppliedPTransform_ReadFileContent/Count_13))))+((ref_AppliedPTransform_Determine Type_14)+(ref_AppliedPTransform_DOC to PDF_15)))+((ref_AppliedPTransform_Convert to TEI_16)+(((ref_AppliedPTransform_TEI to JATS_17)+(ref_AppliedPTransform_Map()_18))+((ref_AppliedPTransform_WriteOutput/Map()_20)+(ref_AppliedPTransform_WriteOutput/Log_21))))
INFO:root:start <DoOperation WriteOutput/Log output_tags=['out']>
INFO:root:start <DoOperation Map() output_tags=['out']>
INFO:root:start <DoOperation WriteOutput/Map() output_tags=['out']>
INFO:root:start <DoOperation TEI to JATS output_tags=['out']>
INFO:root:start <DoOperation Convert to TEI output_tags=['out']>
INFO:root:start <DoOperation DOC to PDF output_tags=['out']>
INFO:root:start <DoOperation Determine Type output_tags=['out']>
INFO:root:start <DoOperation ReadFileContent/Count output_tags=['out']>
INFO:root:start <DoOperation ReadFileContent/Map() output_tags=['out']>
INFO:root:start <DoOperation PreventFusion/Ungroup output_tags=['out']>
INFO:root:start <DataInputOperation PreventFusion/GroupByKey/Read receivers=[ConsumerSet[PreventFusion/GroupByKey/Read.out0, coder=WindowedValueCoder[TupleCoder[LengthPrefixCoder[FastPrimitivesCoder], IterableCoder[LengthPrefixCoder[FastPrimitivesCoder]]]], len(consumers)=1]]>
INFO:root:finish <DataInputOperation PreventFusion/GroupByKey/Read receivers=[ConsumerSet[PreventFusion/GroupByKey/Read.out0, coder=WindowedValueCoder[TupleCoder[LengthPrefixCoder[FastPrimitivesCoder], IterableCoder[LengthPrefixCoder[FastPrimitivesCoder]]]], len(consumers)=1]]>
INFO:root:finish <DoOperation PreventFusion/Ungroup output_tags=['out'], receivers=[ConsumerSet[PreventFusion/Ungroup.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
INFO:root:finish <DoOperation ReadFileContent/Map() output_tags=['out'], receivers=[ConsumerSet[ReadFileContent/Map().out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
INFO:root:finish <DoOperation ReadFileContent/Count output_tags=['out'], receivers=[ConsumerSet[ReadFileContent/Count.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
INFO:root:finish <DoOperation Determine Type output_tags=['out'], receivers=[ConsumerSet[Determine Type.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
INFO:root:finish <DoOperation DOC to PDF output_tags=['out'], receivers=[ConsumerSet[DOC to PDF.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
INFO:root:finish <DoOperation Convert to TEI output_tags=['out'], receivers=[ConsumerSet[Convert to TEI.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
INFO:root:finish <DoOperation TEI to JATS output_tags=['out'], receivers=[ConsumerSet[TEI to JATS.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=2]]>
INFO:root:finish <DoOperation WriteOutput/Map() output_tags=['out'], receivers=[ConsumerSet[WriteOutput/Map().out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
INFO:root:finish <DoOperation Map() output_tags=['out'], receivers=[ConsumerSet[Map().out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=0]]>
INFO:root:finish <DoOperation WriteOutput/Log output_tags=['out'], receivers=[ConsumerSet[WriteOutput/Log.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=0]]>

Are you sure you are in the correct directory? It doesn't seem to process any file. Instead of pwd you could also add in the path, e.g.:

docker run --rm --net host -v/home/lopo:/data" elifesciences/sciencebeam \
  python -m sciencebeam.pipeline_runners.beam_pipeline_runner \
  --grobid-url=http://localhost:8070/api \
  --data-path "/data" \
  --source-path "/data/*.pdf" \
  --output-path "/data/output"

You can also check the mounted directory contains the files, e.g.:

docker run --rm -v"$(pwd):/data" elifesciences/sciencebeam ls -l /data
feq70 commented

It's happening again, but this time I gett 400 Client Error

raise HTTPError(http_error_msg, response=self)

requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8070/api/processHeaderDocument [while running 'Map(<functools.partial object at 0x7efcf0c6e100>)']

roor@tvm:~/toq$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
87085sdsd16 lfoppiano/grobid:0.5.1 "/tini -- ./grobid-s…" 11 minutes ago Up 11 minutes 0.0.0.0:8070->8070/tcp serene_ardinghelli

ntu@tos-vm:~/tos/feq$ telnet localhost 8070
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

What happens if you submit the same file to GROBID directly? e.g. via the browser? Have you tried a different file?

Can this issue be closed?

Closing due to inactivity, happy to re-open