C4 fails on Google Dataflow
WenqiJiang opened this issue · 1 comments
Describe the bug
I experienced some problems (worker fails) when reproducing the C4 experiments. I rent an instance on GCP and followed the instructions in the readme about running C4 on Google Dataflow. Here, instead of using tfds-nightly, I used tensorflow-datasets because the former caused another problem that I will show later.
Environment setup (followed the instructions except that I used tensorflow-datasets ):
pip install tensorflow
pip install google-apitools
pip install tensorflow-datasets
pip install tensorflow-datasets[c4]
echo 'tensorflow-datasets[c4]' > /tmp/beam_requirements.txt
# I also installed Rust and gcld3, not showing here
Running the experiment (followed the instructions):
MY_REGION=us-east1
MY_BUCKET=c4_dataset
MY_PROJECT=my_project_name
python3 -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=c4/en \
--data_dir=gs://$MY_BUCKET/tensorflow_datasets \
--beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service,region=$MY_REGION"
I also tried the tfds command line, but the result is the same:
tfds build c4/en --data_dir=gs://$MY_BUCKET/tensorflow_datasets --beam_pipeline_options="runner=DataflowRunner,project=$MY_PROJECT,job_name=c4-gen,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,requirements_file=/tmp/beam_requirements.txt,region=$MY_REGION"
Here is what I got. Seems the first worker starts, but failed to proceed and stopped due to timeout after 1 hour.
Here is the log (only showing the bottom of the log that may related to the problem):
INFO[dataflow_runner.py]: 2022-02-22T17:43:19.975Z: JOB_MESSAGE_DEBUG: Value "ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Session" materialized.
INFO[dataflow_runner.py]: 2022-02-22T17:43:20.046Z: JOB_MESSAGE_BASIC: Executing operation create_wet_path_urls/Read+ReadAllFromText/ReadAllFiles/ExpandIntoRanges+ReadAllFromText/ReadAllFiles/Reshard/AddRandomKeys+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/Map(reify_timestamps)+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Reify+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Write
INFO[dataflow_runner.py]: 2022-02-22T17:43:43.898Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently running stage(s).
INFO[dataflow_runner.py]: 2022-02-22T17:44:17.947Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO[dataflow_runner.py]: 2022-02-22T17:44:17.970Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO[transport.py]: Refreshing due to a 401 (attempt 1/2)
INFO[transport.py]: Refreshing due to a 401 (attempt 1/2)
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.001Z: JOB_MESSAGE_ERROR: Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.069Z: JOB_MESSAGE_BASIC: Cancel request is committed for workflow job: 2022-02-22_09_42_58-4859348570047340925.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.097Z: JOB_MESSAGE_BASIC: Finished operation create_wet_path_urls/Read+ReadAllFromText/ReadAllFiles/ExpandIntoRanges+ReadAllFromText/ReadAllFiles/Reshard/AddRandomKeys+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/Map(reify_timestamps)+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Reify+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Write
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.104Z: JOB_MESSAGE_ERROR: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.555Z: JOB_MESSAGE_DETAILED: Cleaning up.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.628Z: JOB_MESSAGE_DEBUG: Starting worker pool teardown.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.671Z: JOB_MESSAGE_BASIC: Stopping worker pool...
INFO[dataflow_runner.py]: 2022-02-22T18:45:30.603Z: JOB_MESSAGE_DETAILED: Autoscaling: Resized worker pool from 1 to 0.
INFO[dataflow_runner.py]: 2022-02-22T18:45:30.647Z: JOB_MESSAGE_BASIC: Worker pool stopped.
INFO[dataflow_runner.py]: 2022-02-22T18:45:30.689Z: JOB_MESSAGE_DEBUG: Tearing down pending resources...
INFO[dataflow_runner.py]: Job 2022-02-22_09_42_58-4859348570047340925 is in state JOB_STATE_FAILED
Traceback (most recent call last):
File "/home/contact_ds3lab/c4-tensorflow-datasets/bin/tfds", line 8, in <module>
sys.exit(launch_cli())
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/main.py", line 102, in launch_cli
app.run(main, flags_parser=_parse_flags)
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/main.py", line 97, in main
args.subparser_fn(args)
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/build.py", line 192, in _build_datasets
_download_and_prepare(args, builder)
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/build.py", line 345, in _download_and_prepare
download_config=dl_config,
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 464, in download_and_prepare
download_config=download_config,
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1197, in _download_and_prepare
split_info_futures.append(future)
File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
next(self.gen)
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/core/split_builder.py", line 173, in maybe_beam_pipeline
self._beam_pipeline.__exit__(None, None, None)
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/apache_beam/pipeline.py", line 597, in __exit__
self.result.wait_until_finish()
File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1639, in wait_until_finish
self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I tried several times, but all of them failed (after 1 hour timeout):
Versions
Python=3.7.3
Pip=22.0.3
Tensorflow=2.8.0
Tensorflow_datasets=4.5.2
Expected behavior
The dataflow job runs smoothly without failing.
Additional Info when using tfds-nightly
This is what I got by following the instructions:
pip install tfds-nightly[c4]
echo 'tfds-nightly[c4]' > /tmp/beam_requirements.txt
python3 -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=c4/en \
--data_dir=gs://$MY_BUCKET/tensorflow_datasets \
--beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service,region=$MY_REGION"
Error logs (complaining that ' inconsistent name: filename has 'tfds-nightly', but metadata has 'tensorflow-datasets'):
Requested tensorflow-datasets from https://files.pythonhosted.org/packages/2e/6d/71c2cf8a583da35cb93b183330654dce9a57c59fa8622b9aa778059de822/tfds-nightly-0.0.1.dev201811110013.tar.gz#sha256=2e9fdb002da9aa79375c6448839a69cfce66a26f64120312f7fc98f6167f1352 (from -r /tmp/tmpthuohvb2/tmp_requirements.txt (line 1)) has inconsistent name: filename has 'tfds-nightly', but metadata has 'tensorflow-datasets'\n Using cached tfds-nightly-0.0.1.dev201811100013.tar.gz (47 kB)\n Preparing metadata (setup.py): started\n Preparing metadata (setup.py): finished with status 'done'\n WARNING: Generating metadata for package tfds-nightly produced metadata for project name tensorflow-datasets. Fix your #egg=tfds-nightly fragments.\nDiscarding https://files.pythonhosted.org/packages/2a/34/088cf3df80424abcb92227acab7267358ba600b0d8a47ae53aa960384683/tfds-nightly-0.0.1.dev201811100013.tar.gz#sha256=fee1ced83141e36ee52eac245f0ce250b46f68e66a3a02f6093e18b96f274123 (from https://pypi.org/simple/tfds-nightly/):
Requested tensorflow-datasets from https://files.pythonhosted.org/packages/2a/34/088cf3df80424abcb92227acab7267358ba600b0d8a47ae53aa960384683/tfds-nightly-0.0.1.dev201811100013.tar.gz#sha256=fee1ced83141e36ee52eac245f0ce250b46f68e66a3a02f6093e18b96f274123 (from -r /tmp/tmpthuohvb2/tmp_requirements.txt (line 1)) has inconsistent name: filename has 'tfds-nightly', but metadata has 'tensorflow-datasets'\n Using cached tfds-nightly-0.0.1.dev201811091944.tar.gz (47 kB)\n Preparing metadata (setup.py): started\n Preparing metadata (setup.py): finished with status 'done'\n WARNING: Generating metadata for package tfds-nightly produced metadata for project name tensorflow-datasets. Fix your #egg=tfds-nightly fragments.\nDiscarding https://files.pythonhosted.org/packages/cf/1c/df05d1cc9ca9fc5c0b0b87f33d4be82f1e618d240260dfe5dd60fdf36362/tfds-nightly-0.0.1.dev201811091944.tar.gz#sha256=5a701264bad4e6b38b22f3aca18fbc103e0bef3378b5e1505170e236deb80fa6 (from https://pypi.org/simple/tfds-nightly/): Requested tensorflow-datasets from https://files.pythonhosted.org/packages/75/a9/08b4119be5f3f611a13107c79656d017e54191c777b000554a648635fcbd/tfds-nightly-0.0.1.dev20181106.tar.gz#sha256=1749bd45c0cde5764786c9217a2d145ab3274543052702fa277d9a5eb188c71d (from -r /tmp/tmpthuohvb2/tmp_requirements.txt (line 1)) has inconsistent name: filename has 'tfds-nightly', but metadata has 'tensorflow-datasets'\nERROR: Could not find a version that satisfies the requirement tfds-nightly[c4] (from versions: 0.0.1.dev20181106, 0.0.1.dev201811070013, 0.0.1.dev201811080013, 0.0.1.dev201811090014, 0.0.1.dev201811091944, 0.0.1.dev201811100013,
...
skipping a lot of versions here in the log
...
4.5.2.dev202202150045, 4.5.2.dev202202160044, 4.5.2.dev202202170043, 4.5.2.dev202202180044, 4.5.2.dev202202190044, 4.5.2.dev202202200044, 4.5.2.dev202202210043, 4.5.2.dev202202220043)\nERROR: No matching distribution found for tfds-nightly[c4]\n"
INFO[stager.py]: Executing command: ['/home/contact_ds3lab/c4-tfds-nightly/bin/python3', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/tmp/tmpthuohvb2/tmp_requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']
Problem solved aftering switching to Ubuntu 20.02. Seems this is related to OS.
I used Debian 10 to run the experiment. When pip install tensorflow-dataset, an error appeared saying gcld3 cannot by installed, thus I installed it manually from source. This might be the reason that caused the failure later on.