google-research/text-to-text-transfer-transformer

C4 fails on Google Dataflow

WenqiJiang opened this issue · 1 comments

Describe the bug

I experienced some problems (worker fails) when reproducing the C4 experiments. I rent an instance on GCP and followed the instructions in the readme about running C4 on Google Dataflow. Here, instead of using tfds-nightly, I used tensorflow-datasets because the former caused another problem that I will show later.

Environment setup (followed the instructions except that I used tensorflow-datasets ):

pip install tensorflow 
pip install google-apitools

pip install tensorflow-datasets 
pip install tensorflow-datasets[c4] 
echo 'tensorflow-datasets[c4]' > /tmp/beam_requirements.txt

# I also installed Rust and gcld3, not showing here

Running the experiment (followed the instructions):

MY_REGION=us-east1
MY_BUCKET=c4_dataset
MY_PROJECT=my_project_name

python3 -m tensorflow_datasets.scripts.download_and_prepare \ 
--datasets=c4/en \ 
--data_dir=gs://$MY_BUCKET/tensorflow_datasets \ 
--beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service,region=$MY_REGION"

I also tried the tfds command line, but the result is the same:

tfds build c4/en --data_dir=gs://$MY_BUCKET/tensorflow_datasets --beam_pipeline_options="runner=DataflowRunner,project=$MY_PROJECT,job_name=c4-gen,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,requirements_file=/tmp/beam_requirements.txt,region=$MY_REGION"

Here is what I got. Seems the first worker starts, but failed to proceed and stopped due to timeout after 1 hour.

image

Here is the log (only showing the bottom of the log that may related to the problem):

INFO[dataflow_runner.py]: 2022-02-22T17:43:19.975Z: JOB_MESSAGE_DEBUG: Value "ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Session" materialized.
INFO[dataflow_runner.py]: 2022-02-22T17:43:20.046Z: JOB_MESSAGE_BASIC: Executing operation create_wet_path_urls/Read+ReadAllFromText/ReadAllFiles/ExpandIntoRanges+ReadAllFromText/ReadAllFiles/Reshard/AddRandomKeys+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/Map(reify_timestamps)+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Reify+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Write
INFO[dataflow_runner.py]: 2022-02-22T17:43:43.898Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently running stage(s).
INFO[dataflow_runner.py]: 2022-02-22T17:44:17.947Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO[dataflow_runner.py]: 2022-02-22T17:44:17.970Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO[transport.py]: Refreshing due to a 401 (attempt 1/2)
INFO[transport.py]: Refreshing due to a 401 (attempt 1/2)
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.001Z: JOB_MESSAGE_ERROR: Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.069Z: JOB_MESSAGE_BASIC: Cancel request is committed for workflow job: 2022-02-22_09_42_58-4859348570047340925.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.097Z: JOB_MESSAGE_BASIC: Finished operation create_wet_path_urls/Read+ReadAllFromText/ReadAllFiles/ExpandIntoRanges+ReadAllFromText/ReadAllFiles/Reshard/AddRandomKeys+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/Map(reify_timestamps)+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Reify+ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Write
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.104Z: JOB_MESSAGE_ERROR: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.555Z: JOB_MESSAGE_DETAILED: Cleaning up.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.628Z: JOB_MESSAGE_DEBUG: Starting worker pool teardown.
INFO[dataflow_runner.py]: 2022-02-22T18:43:20.671Z: JOB_MESSAGE_BASIC: Stopping worker pool...
INFO[dataflow_runner.py]: 2022-02-22T18:45:30.603Z: JOB_MESSAGE_DETAILED: Autoscaling: Resized worker pool from 1 to 0.
INFO[dataflow_runner.py]: 2022-02-22T18:45:30.647Z: JOB_MESSAGE_BASIC: Worker pool stopped.
INFO[dataflow_runner.py]: 2022-02-22T18:45:30.689Z: JOB_MESSAGE_DEBUG: Tearing down pending resources...
INFO[dataflow_runner.py]: Job 2022-02-22_09_42_58-4859348570047340925 is in state JOB_STATE_FAILED
Traceback (most recent call last):
  File "/home/contact_ds3lab/c4-tensorflow-datasets/bin/tfds", line 8, in <module>
    sys.exit(launch_cli())
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/main.py", line 102, in launch_cli
    app.run(main, flags_parser=_parse_flags)
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/main.py", line 97, in main
    args.subparser_fn(args)
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/build.py", line 192, in _build_datasets
    _download_and_prepare(args, builder)
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/scripts/cli/build.py", line 345, in _download_and_prepare
    download_config=dl_config,
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 464, in download_and_prepare
    download_config=download_config,
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1197, in _download_and_prepare
    split_info_futures.append(future)
  File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/tensorflow_datasets/core/split_builder.py", line 173, in maybe_beam_pipeline
    self._beam_pipeline.__exit__(None, None, None)
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/apache_beam/pipeline.py", line 597, in __exit__
    self.result.wait_until_finish()
  File "/home/contact_ds3lab/c4-tensorflow-datasets/lib/python3.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1639, in wait_until_finish
    self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.

I tried several times, but all of them failed (after 1 hour timeout):

image

Versions

Python=3.7.3
Pip=22.0.3
Tensorflow=2.8.0
Tensorflow_datasets=4.5.2

Expected behavior
The dataflow job runs smoothly without failing.

Additional Info when using tfds-nightly

This is what I got by following the instructions:

pip install tfds-nightly[c4] 
echo 'tfds-nightly[c4]' > /tmp/beam_requirements.txt

python3 -m tensorflow_datasets.scripts.download_and_prepare \ 
--datasets=c4/en \ 
--data_dir=gs://$MY_BUCKET/tensorflow_datasets \ 
--beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service,region=$MY_REGION"

Error logs (complaining that ' inconsistent name: filename has 'tfds-nightly', but metadata has 'tensorflow-datasets'):


Requested tensorflow-datasets from https://files.pythonhosted.org/packages/2e/6d/71c2cf8a583da35cb93b183330654dce9a57c59fa8622b9aa778059de822/tfds-nightly-0.0.1.dev201811110013.tar.gz#sha256=2e9fdb002da9aa79375c6448839a69cfce66a26f64120312f7fc98f6167f1352 (from -r /tmp/tmpthuohvb2/tmp_requirements.txt (line 1)) has inconsistent name: filename has 'tfds-nightly', but metadata has 'tensorflow-datasets'\n  Using cached tfds-nightly-0.0.1.dev201811100013.tar.gz (47 kB)\n  Preparing metadata (setup.py): started\n  Preparing metadata (setup.py): finished with status 'done'\n  WARNING: Generating metadata for package tfds-nightly produced metadata for project name tensorflow-datasets. Fix your #egg=tfds-nightly fragments.\nDiscarding https://files.pythonhosted.org/packages/2a/34/088cf3df80424abcb92227acab7267358ba600b0d8a47ae53aa960384683/tfds-nightly-0.0.1.dev201811100013.tar.gz#sha256=fee1ced83141e36ee52eac245f0ce250b46f68e66a3a02f6093e18b96f274123 (from https://pypi.org/simple/tfds-nightly/): 
Requested tensorflow-datasets from https://files.pythonhosted.org/packages/2a/34/088cf3df80424abcb92227acab7267358ba600b0d8a47ae53aa960384683/tfds-nightly-0.0.1.dev201811100013.tar.gz#sha256=fee1ced83141e36ee52eac245f0ce250b46f68e66a3a02f6093e18b96f274123 (from -r /tmp/tmpthuohvb2/tmp_requirements.txt (line 1)) has inconsistent name: filename has 'tfds-nightly', but metadata has 'tensorflow-datasets'\n  Using cached tfds-nightly-0.0.1.dev201811091944.tar.gz (47 kB)\n  Preparing metadata (setup.py): started\n  Preparing metadata (setup.py): finished with status 'done'\n  WARNING: Generating metadata for package tfds-nightly produced metadata for project name tensorflow-datasets. Fix your #egg=tfds-nightly fragments.\nDiscarding https://files.pythonhosted.org/packages/cf/1c/df05d1cc9ca9fc5c0b0b87f33d4be82f1e618d240260dfe5dd60fdf36362/tfds-nightly-0.0.1.dev201811091944.tar.gz#sha256=5a701264bad4e6b38b22f3aca18fbc103e0bef3378b5e1505170e236deb80fa6 (from https://pypi.org/simple/tfds-nightly/):  Requested tensorflow-datasets from https://files.pythonhosted.org/packages/75/a9/08b4119be5f3f611a13107c79656d017e54191c777b000554a648635fcbd/tfds-nightly-0.0.1.dev20181106.tar.gz#sha256=1749bd45c0cde5764786c9217a2d145ab3274543052702fa277d9a5eb188c71d (from -r /tmp/tmpthuohvb2/tmp_requirements.txt (line 1)) has inconsistent name: filename has 'tfds-nightly', but metadata has 'tensorflow-datasets'\nERROR: Could not find a version that satisfies the requirement tfds-nightly[c4] (from versions: 0.0.1.dev20181106, 0.0.1.dev201811070013, 0.0.1.dev201811080013, 0.0.1.dev201811090014, 0.0.1.dev201811091944, 0.0.1.dev201811100013, 

...
skipping a lot of versions here in the  log
...

4.5.2.dev202202150045, 4.5.2.dev202202160044, 4.5.2.dev202202170043, 4.5.2.dev202202180044, 4.5.2.dev202202190044, 4.5.2.dev202202200044, 4.5.2.dev202202210043, 4.5.2.dev202202220043)\nERROR: No matching distribution found for tfds-nightly[c4]\n"
INFO[stager.py]: Executing command: ['/home/contact_ds3lab/c4-tfds-nightly/bin/python3', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/tmp/tmpthuohvb2/tmp_requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']

Problem solved aftering switching to Ubuntu 20.02. Seems this is related to OS.

I used Debian 10 to run the experiment. When pip install tensorflow-dataset, an error appeared saying gcld3 cannot by installed, thus I installed it manually from source. This might be the reason that caused the failure later on.