seqio_cache_tasks fails on DataflowRunner
bzz opened this issue · 3 comments
When trying to cache a dataset that does not fit DirectRunner (e.g google-research/text-to-text-transfer-transformer#323 (comment)) on Cloud Dataflow without any requirements.txt
, like
python -m seqio.scripts.cache_tasks_main \
--module_import="..." \
--tasks="${TASK_NAME}" \
--output_cache_dir="${BUCKET}/cache" \
--alsologtostderr \
--pipeline_options="--runner=DataflowRunner,--project=$PROJECT,--region=$REGION,--job_name=$TASK_NAME,--staging_location=$BUCKET/binaries,--temp_location=$BUCKET/tmp,--experiments=shuffle_mode=appliance"
it fails with ModuleNotFoundError: No module named 'seqio'
.
If seqio
added with
echo seqio > /tmp/beam_requirements.txt
# and run the same, adding to `--pipeline_options`
--requirements_file=/tmp/beam_requirements.txt
it fails with
subprocess.CalledProcessError: Command '['.../.venv/bin/python', '-m', 'pip', 'download', '--dest', '..../pip-tmp
/dataflow-requirements-cache', '-r', '/tmp/beam_requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.
Pip install failed for package: -r
Output from execution of subprocess: b"ERROR: Could not find a version that satisfies the requirement tensorflow-text (from versions: none)\
nERROR: No matching distribution found for tensorflow-text
This seems to be cause by seqio
depending on tensorflow-text
, which does not have any source release artifacts.
But requirements cache in Apache Beam seem to be populated with --no-binary :all:
before making it available to the workers.
A try on a clean venv results in the same:
pip3 install --no-binary :all: --no-deps tensorflow-text==2.6.0
ERROR: Could not find a version that satisfies the requirement tensorflow-text==2.6.0 (from versions: none)
ERROR: No matching distribution found for tensorflow-text==2.6.0
Am I doing something wrong, or how does everyone work this around? Would appreciate a hand here.
In case anyone else stumbles upon this or lands though the search - kind people at Apache Beam community have pointed out https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython
Adding pip install seqio
as a custom command in setup.py
and passing that though --setup_file=$PWD/setup.py
did the trick.
I'll be happy to submit a doc patch with instructions in case anyone points me to the right place to put it.
Just to add another possible solution, after some time trying, what finally worked for me was a combination of setup.py and use of custom docker image for Dataflow workers (https://cloud.google.com/dataflow/docs/guides/using-custom-containers). setup.py
is used only to package necessary code for tasks and preprocessors definition, and other requirements (including seqio and t5) can be specified on Dockerfile.