sign-language-processing/datasets

Add download size information to documentation

Opened this issue · 14 comments

A la tensorflow/datasets#120, it would be helpful to have an estimate of how large each dataset is before downloading. Ideally, a breakdown by feature would be nice.

Currently taking a crack at the following:

  • on a machine that has a very large hard drive, try downloading everything in the example notebook
  • run the builder "size in bytes" function mentioned in the tfds issue mentioned above.

What I'm trying:

  1. clone the tensorflow_datasets repo
  2. activate an environment with sign_language_datasets installed
  3. run the documentation scripts.

Had to pip install pyyaml and pandas, then ran the build_catalog.py and it complained about not having a "stable_versions.txt".

That seems to come from https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/freeze_dataset_versions.py

When I run THAT, it outputs 5812 datasets versions to a file in my conda env

/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/stable_versions.txt.

So of course I want it to also register the sign language datasets, right? So I edited the file to import that as well, like so:

from absl import app

import tensorflow_datasets as tfds
import sign_language_datasets.datasets


def main(_):
  tfds.core.visibility.set_availables([
      tfds.core.visibility.DatasetType.TFDS_PUBLIC,
  ])

  registered_names = tfds.core.load.list_full_names()
  version_path = tfds.core.utils.tfds_write_path() / 'stable_versions.txt'
  version_path.write_text('\n'.join(registered_names))
  print(f'{len(registered_names)} datasets versions written to {version_path}.')


if __name__ == '__main__':
  app.run(main)

When I run it THEN, it writes 5858 dataset versions instead. Opening up stable_versions, I see a few SL datasets including autsl.

tfds_stable_versions_no_sl.txt
tfds_stable_versions_sl.txt
The two different versions of the .txt file, copied and renamed.

apparently the comm utility lets you find diffs easily
image

output:

comm -23 tfds_stable_versions_sl.txt tfds_stable_versions_no_sl.txt > sl_stable_versions.txt
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
comm: input is not in sorted order

OK, let's sort then.

list filenames, pipe to gnu parallel (yes I will cite it don't worry), and use sort and output to _sorted.txt

ls tfds_stable_versions* | parallel sort --output {.}_sorted.txt {}

NOW:

comm -23 tfds_stable_versions_sl_sorted.txt tfds_stable_versions_no_sl_sorted.txt > tfds_stable_versions_sl_only.txt

Which gives us

asl_citizen/default/1.0.0
aslg_pc12/0.0.1
asl_lex/annotations/2.0.0
asl_lex/default/2.0.0
asl_signs/default/1.0.0
autsl/default/1.0.0
autsl/holistic/1.0.0
autsl/openpose/1.0.0
bsl_corpus/annotations/1.0.0
bsl_corpus/default/1.0.0
chicago_fs_wild/default/2.0.0
dgs_corpus/annotations/3.0.0
dgs_corpus/default/3.0.0
dgs_corpus/holistic/3.0.0
dgs_corpus/openpose/3.0.0
dgs_corpus/sentences/3.0.0
dgs_corpus/videos/3.0.0
dgs_types/annotations/3.0.0
dgs_types/default/3.0.0
dgs_types/holistic/3.0.0
dicta_sign/annotations/1.0.0
dicta_sign/default/1.0.0
dicta_sign/poses/1.0.0
how2_sign/default/1.0.0
mediapi_skel/default/1.0.0
ngt_corpus/annotations/3.0.0
ngt_corpus/default/3.0.0
ngt_corpus/videos/3.0.0
rwth_phoenix2014_t/annotations/3.0.0
rwth_phoenix2014_t/default/3.0.0
rwth_phoenix2014_t/poses/3.0.0
rwth_phoenix2014_t/videos/3.0.0
sem_lex/default/1.0.0
sign2_mint/annotations/1.0.0
sign2_mint/default/1.0.0
sign_bank/default/1.0.0
sign_suisse/default/1.0.0
sign_suisse/holistic/1.0.0
sign_typ/default/1.0.0
sign_wordnet/default/0.2.0
spread_the_sign/default/1.0.0
swojs_glossario/annotations/1.0.0
swojs_glossario/default/1.0.0
wlasl/default/0.3.0
wmtslt/annotations/1.2.0
wmtslt/default/1.2.0

Which, I'm just gonna overwrite the stable_versions.txt with that...

cat tfds_stable_versions_sl_only.txt > /home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/stable_versions.txt.

Sigh:
image

Offending assertion:
image

Note also that it's using the document_datasets.py in the site-packages, not in the cloned repo.
/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/scripts/documentation/document_datasets.py

Just gonna comment that bit and try again... FileNotFoundError: Error for asl_citizen: [Errno 2] No such file or directory: '/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/scripts/documentation/tfds_to_pwc_links.json'

Which, digging into the code, it's looking for "# Filepath for mapping between TFDS datasets and PapersWithCode entries."

OK so in dataset_markdown_builder has a bunch of sections, we don't care about them. What if we comment those out?

image

Still no luck Getting weird auth token errors. Tried a few datasets.

image

I give up. This seems like a dead end.

Setup a script to simply loop through available datasets and tfds.load every builder config. Then I can read download and dataset size from the returned ds_info.

DGS Corpus is the one holdout, because the download process crashes very consistently. Even when passing it process_video=False I have not figured out any way to download the various configs other than "annotations". Spent two hours trying. And tfds has no method to download only, without preparing.

Who decided that download_and_prepare was a good idea for a function? Functions should do one thing!

managed to download many of the datasets and check the sizes, or log the error

{
    'AUTSL/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 22.66 GiB, 'dataset_size': 577.97 GiB
    }), 
    'AUTSL/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 13.80 GiB, 'dataset_size': 22.40 GiB
    }), 
    'AUTSL/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.03 GiB, 'dataset_size': 3.35 GiB
    }), 
    'AslLex/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB
    }), 
    'AslLex/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB
    }), 
    'ChicagoFSWild/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_result': ExtractError('Error while extracting /media/vlab/storage/data/tfds/downloads/dl.ttic.edu_ChicagoFSWildPlusqUnEJhkfLO-xyo77-A39kON0j8HEXykw-6Nwasi2iPY.tgz to /media/vlab/storage/data/tfds/downloads/extracted/TAR_GZ.dl.ttic.edu_ChicagoFSWildPlusqUnEJhkfLO-xyo77-A39kON0j8HEXykw-6Nwasi2iPY.tgz: ')
    }), 
    'DgsTypes/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_result': DownloadError('Failed to get url https: //www.sign-lang.uni-hamburg.de/korpusdict/clips/3252569_1.mp4. HTTP code: 404.')}), 
    'DgsTypes/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_result': TypeError("Error while serializing feature `views/pose/data`: `TensorInfo(shape=(None, None, 1, 576, 3), dtype=float32)`: 'NoneType' object cannot be interpreted as an integer")
        }), 
    'DgsTypes/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 336.11 MiB, 'dataset_size': 1.72 MiB
        }), 
    'DictaSign/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 2.58 GiB, 'dataset_size': 3.34 GiB
        }), 
    'DictaSign/poses': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 2.18 GiB, 'dataset_size': 3.34 GiB
        }), 'DictaSign/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 7.09 MiB, 'dataset_size': 1.15 MiB
        }), 'How2Sign/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_result': DownloadError('Failed to get url https: //drive.usercontent.google.com/download?id=1dYey1F_SeHets-UO8F9cE3VMhRBO-6e0&export=download. HTTP code: 404.')}), 'NGTCorpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'NGTCorpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'NGTCorpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 76.40 MiB, 'dataset_size': 389.65 KiB}), 'RWTHPhoenix2014T/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'RWTHPhoenix2014T/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'RWTHPhoenix2014T/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 5.14 GiB, 'dataset_size': 7.67 GiB}), 'RWTHPhoenix2014T/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 806.71 KiB, 'dataset_size': 1.90 MiB}), 'Sign2MINT/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}), 'Sign2MINT/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}), 'SignBank/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 113.86 MiB, 'dataset_size': 140.10 MiB}), 'SignSuisse/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.77 MiB, 'dataset_size': 4.97 MiB}), 'SignSuisse/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 33.57 GiB, 'dataset_size': 9.96 GiB}), 'SignTyp/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ConnectionError(MaxRetryError('HTTPSConnectionPool(host=\'signtyp.uconn.edu\', port=443): Max retries exceeded with url: /signpuddle/export.php (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7c2520789650>: Failed to resolve \'signtyp.uconn.edu\' ([Errno -2] Name or service not known)"))'))}), 'SignWordnet/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ImportError('Please install nltk with: pip install nltk')}), 'SwojsGlossario/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'SwojsGlossario/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'Wlasl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': Exception('die')}), 'asl_lex/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB}), 'asl_lex/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB}), 'autsl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 22.66 GiB, 'dataset_size': 577.97 GiB}), 'autsl/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 13.80 GiB, 'dataset_size': 22.40 GiB}), 'autsl/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.03 GiB, 'dataset_size': 3.35 GiB}), 'dgs_corpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 46.23 GiB, 'dataset_size': 27.56 GiB}), 'dgs_corpus/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/sentences': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_types/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': DownloadError('Failed to get url https://www.sign-lang.uni-hamburg.de/korpusdict/clips/3252569_1.mp4. HTTP code: 404.')}), 'dgs_types/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': TypeError("Error while serializing feature `views/pose/data`: `TensorInfo(shape=(None, None, 1, 576, 3), dtype=float32)`: 'NoneType' object cannot be interpreted as an integer")}), 'dgs_types/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 336.11 MiB, 'dataset_size': 1.72 MiB}), 'dicta_sign/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.58 GiB, 'dataset_size': 3.34 GiB}), 'dicta_sign/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.18 GiB, 'dataset_size': 3.34 GiB}), 'dicta_sign/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 7.09 MiB, 'dataset_size': 1.15 MiB}), 'ngt_corpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'ngt_corpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'ngt_corpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 76.40 MiB, 'dataset_size': 389.65 KiB}), 'rwth_phoenix2014_t/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'rwth_phoenix2014_t/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'rwth_phoenix2014_t/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 5.14 GiB, 'dataset_size': 7.67 GiB}), 'rwth_phoenix2014_t/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 806.71 KiB, 'dataset_size': 1.90 MiB}), 'sign_wordnet/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ImportError('Please install nltk with: pip install nltk')}), 'swojs_glossario/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'swojs_glossario/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'wlasl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': Exception('die')})}