cfoster0/CLAP

Add Dataset Downloading, Info, and Checksums

cfoster0 opened this issue ยท 12 comments

Add Dataset Downloading, Info, and Checksums

We want to go for the largest datasets we can for this. They are listed in a Google doc. Not all of them will be downloadable via public links, so we want to provide checksums in this repo so that folks know they're working with the same data once they acquire it. Would also be nice to give the dataset info in the repo.

EDIT: Google doc is linked here.

Sounds good, haven't heard back from Spotify yet. Mozilla common voice checksum:

https://commonvoice.mozilla.org/en/datasets

sha256 checksum: 0f8fdfc4fe715738be94ee49c4fb63d5f1608d2e6a43a2bed80f6cb871171c36

Some stats on Common Voice English version 6.1.

1,224,864 validated clips, of which 1,224,858 have UTF-8 captions. 596,665 unique sentences
52 Python characters on average, and 52.9 bytes on average.

Quantiles of byte lengths:

10% - 23
20% - 32
30% - 39
40% - 45
50% - 52
60% - 60
70% - 67
80% - 74
90% - 83
95% - 90
98% - 97

Max byte length of 210.

From a quick sample of the audio data, the average clip length is just under 6 seconds.

@cfoster0 what's the expected format here? Images of spectrograms?

@afiaka87 Good question. For now, the plan is to preprocess the data in two steps:

  1. Trimmed 15 second .wav files, padded with silence if the original audio clip was shorter. Plus a dataframe mapping filenames to their text captions.
  2. Mel spectrograms of the audio saved as .pt files. lm_dataformat archive of the captions. (This step may change to TFRecords in the future)

Why not just the images? We've got working code for that over in DALLE-pytorch right now which has loaded well in excess of 5 million image-text-pairs without bottlenecking. Could get started a bit faster that way and leave this issue open to implement a more efficient storage solution if the dataloader becomes a bottleneck.

We want to go for the largest datasets we can for this. They are listed in a Google doc. Not all of them will be downloadable via public links, so we want to provide checksums in this repo so that folks know they're working with the same data once they acquire it. Would also be nice to give the dataset info in the repo.

Can you at least list them without the downloaded links? Or share a link to said Google document?

For sure. Give me a minute and I'll list them here for starters.

And I don't quite know what you mean by images. Spectrograms aren't really images even though you can look at them as if they were. For small scale tests I don't think the current code will bottleneck us.

Largest English speech datasets:

The code within this repo should be agnostic to language and speech vs. non-speech audio. To see a larger list of datasets, see the Google doc here.

And I don't quite know what you mean by images. Spectrograms aren't really images even though you can look at them as if they were. For small scale tests I don't think the current code will bottleneck us.

Ah my apologies - you're correct, there's no reason to store them as a visual representation.

Largest English speech datasets:

* Spotify Podcasts Dataset https://podcastsdataset.byspotify.com/

* MLS http://www.openslr.org/94/

* Common Voice https://commonvoice.mozilla.org/en

* SPGISpeech https://datasets.kensho.com/datasets/spgispeech

The code within this repo should be agnostic to language and speech vs. non-speech audio. To see a larger list of datasets, see the Google doc here.

Fantastic that's quite the list! Thanks!