Add Dataset Downloading, Info, and Checksums

Question

Add Dataset Downloading, Info, and Checksums

cfoster0 opened this issue 4 years ago · 12 comments

Answer 1 · 2021-04-16T07:43:30.000Z

We want to go for the largest datasets we can for this. They are listed in a Google doc. Not all of them will be downloadable via public links, so we want to provide checksums in this repo so that folks know they're working with the same data once they acquire it. Would also be nice to give the dataset info in the repo.

EDIT: Google doc is linked here.

Answer 2 · 2021-04-16T21:10:02.000Z

Sounds good, haven't heard back from Spotify yet. Mozilla common voice checksum:

https://commonvoice.mozilla.org/en/datasets

sha256 checksum: 0f8fdfc4fe715738be94ee49c4fb63d5f1608d2e6a43a2bed80f6cb871171c36

Answer 3 · 2021-04-26T01:13:56.000Z

Some stats on Common Voice English version 6.1.

1,224,864 validated clips, of which 1,224,858 have UTF-8 captions. 596,665 unique sentences
52 Python characters on average, and 52.9 bytes on average.

Quantiles of byte lengths:

10% - 23
20% - 32
30% - 39
40% - 45
50% - 52
60% - 60
70% - 67
80% - 74
90% - 83
95% - 90
98% - 97

Max byte length of 210.

From a quick sample of the audio data, the average clip length is just under 6 seconds.

Answer 4 · 2021-05-10T01:54:53.000Z

@cfoster0 what's the expected format here? Images of spectrograms?

Answer 5 · 2021-05-10T02:06:41.000Z

@afiaka87 Good question. For now, the plan is to preprocess the data in two steps:

Trimmed 15 second .wav files, padded with silence if the original audio clip was shorter. Plus a dataframe mapping filenames to their text captions.
Mel spectrograms of the audio saved as .pt files. lm_dataformat archive of the captions. (This step may change to TFRecords in the future)

Answer 6 · 2021-05-10T02:14:51.000Z

Why not just the images? We've got working code for that over in DALLE-pytorch right now which has loaded well in excess of 5 million image-text-pairs without bottlenecking. Could get started a bit faster that way and leave this issue open to implement a more efficient storage solution if the dataloader becomes a bottleneck.

Answer 7 · 2021-05-10T02:21:34.000Z

We want to go for the largest datasets we can for this. They are listed in a Google doc. Not all of them will be downloadable via public links, so we want to provide checksums in this repo so that folks know they're working with the same data once they acquire it. Would also be nice to give the dataset info in the repo.

Can you at least list them without the downloaded links? Or share a link to said Google document?

Answer 8 · 2021-05-10T02:24:55.000Z

For sure. Give me a minute and I'll list them here for starters.

Answer 9 · 2021-05-10T02:26:50.000Z

And I don't quite know what you mean by images. Spectrograms aren't really images even though you can look at them as if they were. For small scale tests I don't think the current code will bottleneck us.

Answer 10 · 2021-05-10T02:36:55.000Z

Largest English speech datasets:

Spotify Podcasts Dataset https://podcastsdataset.byspotify.com/
MLS http://www.openslr.org/94/
Common Voice https://commonvoice.mozilla.org/en
SPGISpeech https://datasets.kensho.com/datasets/spgispeech

The code within this repo should be agnostic to language and speech vs. non-speech audio. To see a larger list of datasets, see the Google doc here.

Answer 11 · 2021-05-10T03:07:37.000Z

And I don't quite know what you mean by images. Spectrograms aren't really images even though you can look at them as if they were. For small scale tests I don't think the current code will bottleneck us.

Ah my apologies - you're correct, there's no reason to store them as a visual representation.

Answer 12 · 2021-05-10T03:08:37.000Z

Largest English speech datasets:
* Spotify Podcasts Dataset https://podcastsdataset.byspotify.com/

* MLS http://www.openslr.org/94/

* Common Voice https://commonvoice.mozilla.org/en

* SPGISpeech https://datasets.kensho.com/datasets/spgispeech
The code within this repo should be agnostic to language and speech vs. non-speech audio. To see a larger list of datasets, see the Google doc here.

Fantastic that's quite the list! Thanks!