seqio_cache_tasks fails on DataflowRunner

Question

seqio_cache_tasks fails on DataflowRunner

bzz opened this issue 3 years ago · 3 comments

When trying to cache a dataset that does not fit DirectRunner (e.g google-research/text-to-text-transfer-transformer#323 (comment)) on Cloud Dataflow without any requirements.txt, like

python -m seqio.scripts.cache_tasks_main \
 --module_import="..." \
 --tasks="${TASK_NAME}" \
 --output_cache_dir="${BUCKET}/cache" \
 --alsologtostderr \
 --pipeline_options="--runner=DataflowRunner,--project=$PROJECT,--region=$REGION,--job_name=$TASK_NAME,--staging_location=$BUCKET/binaries,--temp_location=$BUCKET/tmp,--experiments=shuffle_mode=appliance"

it fails with ModuleNotFoundError: No module named 'seqio'.

If seqio added with

echo seqio > /tmp/beam_requirements.txt

# and run the same, adding to `--pipeline_options`
--requirements_file=/tmp/beam_requirements.txt

it fails with

subprocess.CalledProcessError: Command '['.../.venv/bin/python', '-m', 'pip', 'download', '--dest', '..../pip-tmp
/dataflow-requirements-cache', '-r', '/tmp/beam_requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.

 Pip install failed for package: -r
 Output from execution of subprocess: b"ERROR: Could not find a version that satisfies the requirement tensorflow-text (from versions: none)\
nERROR: No matching distribution found for tensorflow-text

This seems to be cause by seqio depending on tensorflow-text, which does not have any source release artifacts.

But requirements cache in Apache Beam seem to be populated with --no-binary :all: before making it available to the workers.

A try on a clean venv results in the same:

pip3 install  --no-binary :all: --no-deps tensorflow-text==2.6.0
ERROR: Could not find a version that satisfies the requirement tensorflow-text==2.6.0 (from versions: none)
ERROR: No matching distribution found for tensorflow-text==2.6.0

Am I doing something wrong, or how does everyone work this around? Would appreciate a hand here.

Answer 1 · 2021-09-22T20:47:03.000Z

In case anyone else stumbles upon this or lands though the search - kind people at Apache Beam community have pointed out https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython

Adding pip install seqio as a custom command in setup.py and passing that though --setup_file=$PWD/setup.py did the trick.

I'll be happy to submit a doc patch with instructions in case anyone points me to the right place to put it.

Answer 2 · 2021-11-08T14:20:28.000Z

@bzz please do add these details to the README

Answer 3 · 2022-01-31T21:06:51.000Z

Just to add another possible solution, after some time trying, what finally worked for me was a combination of setup.py and use of custom docker image for Dataflow workers (https://cloud.google.com/dataflow/docs/guides/using-custom-containers). setup.py is used only to package necessary code for tasks and preprocessors definition, and other requirements (including seqio and t5) can be specified on Dockerfile.