/DataImporters

A collection of data importers or clean-up for various sources

Primary LanguageJupyter NotebookThe UnlicenseUnlicense

Audio Data Importers

A collection of data importers for various audio sources. A loose manual data pipeline.

Install

pip install dataimporters

Downloading Audio Sources

The audio sources have to be provided manually (for now).
The scripts expect a data directory containing the audio folders:

root
 |- data/
      |- original/ (where you have to place the soundbanks)
      |- intermediate/ (generated)
      |- dataset/ (generated)

Contributing

We use nbdev, which compiles all notebooks into a package. The source is at the nbs folder.

How to Use

To create a new dataset package, we simply:

  1. Define and process all sources,
  2. import the Dataset,
  3. give it the sources we'd like to include and the path to our data,
  4. call Dataset.compile

This will process all sources and build a final dataset.zip file.

The library is flexible, but here's the simplest and most common action we perform:

For annotations, see nbs/12_review.ipynb notebook.

#hide_ouput
from DataImporters.core import load_version

DATA_DIR = "data/"
VERSION = load_version()
VERSION
18
from DataImporters.sources.core import process
from DataImporters.sources.space_divers_mini import SpaceDiversMini
from DataImporters.sources.footsteps_one_ppsfx import FootstepsOnePpsfx
from DataImporters.sources.footsteps_two_ppsfx import FootstepsTwoPpsfx
from DataImporters.sources.edward import Edward
from DataImporters.sources.barefoot_metal_sonniss import BarefootMetalSonniss
from DataImporters.sources.custom_fsd import CustomFsd

all_sources = [
    SpaceDiversMini(),
    FootstepsOnePpsfx(),
    FootstepsTwoPpsfx(),
    Edward(),
    BarefootMetalSonniss(),
    CustomFsd()
]

for source in all_sources:
    process(source, DATA_DIR, VERSION)

Below are two examples, one creates a large dataset with the automatic processors, the other is a more balanced dataset, manually annotated.
Choose one to run and then jump to Verify Output

Larger Dataset

from DataImporters.dataset import Dataset, DatasetPaths

DATASET_NAME = "large"

# Same as `all_sources` excluding SpaceDiversMini
sources = [
    FootstepsOnePpsfx(),
    FootstepsTwoPpsfx(),
    Edward(),
    BarefootMetalSonniss(),
    CustomFsd()
]

paths = DatasetPaths(DATA_DIR, DATASET_NAME)
metadata = Dataset(sources, paths).compile()
metadata.shape[0]
Warning: 206 duplicate rows found. Some rows were dropped (all files copied).





1646

Smaller and Annotated

from DataImporters.dataset import Dataset, DatasetPaths

DATASET_NAME = "small_balanced"
ANNOTATION_PATH = os.path.join(DATA_DIR, "annotations", DATASET_NAME + ".csv")

sources = [
    CustomFsd()
]

paths = DatasetPaths(DATA_DIR, DATASET_NAME, ANNOTATION_PATH)
metadata = Dataset(sources, paths).compile()
metadata.shape[0]
Warning: 207 duplicate rows found. Some rows were dropped (all files copied).





284

Verify Output

Dataset.compile will return the newly created metadata (which has already been saved to DATA_PATH).

We can use it to confirm we did indeed copy all files. Since the metadata aggregates all the source metadata, if a file is missing, it will still be in the metadata.
On the other hand, this will also let us know when a file has been deleted from the source, but still exists in the dataset folder.

import os
assert len(os.listdir(paths.audio_output_path)) == len(metadata)

Everything is looking good, we should bump the version.

#hide_output

from DataImporters.core import bump_version
bump_version()

If the assertion fails, this could be due to:

  • Genuine failure to copy
  • Some files in the target folder need deleting
    • Please delete them, no code yet
  • Hash conflict (same content from different sources)
    • In this case, we must debug the sources and make sure there are no duplicates

Dataset Structure

dataset/
  |- README.md
  |- metadata.csv
  |- audio/
       |- Long list of audio files, filenames are the xxhash64 of the content.

Metadata

metadata.csv contains a list of all the files in the dataset and their labels.

filename category label extra source version
File name, assumes all files inside audio folder Single major category name Escaped (“”) comma separated list of labels, in snake_case Extra text/details available for this row (unstructured) Name of original sound library, snake_case Version of the last change. Limited to last change only

version is a simple incremental integer. If you need to check if a file changed/added simply check if the row version is higher than the last version you ran. Deletes are not supported yet.

Here's an example from the sample code ran earlier:

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
filename category label extra source version
0 7620671de38cc6d1.wav Wood Creaky,Door,Close,Wooden,Squeaky,Squeaking,Woo... NaN custom_fsd 18
3 12193cedf99e9427.wav Wood Knock,Wood,Knocking,Knock NaN custom_fsd 18

Pipeline

flowchart TD
    sa[(Source A)] --> pa([Normalise data and create CSV]);
    pa --> ia[(Intermediate A)];
    sb[(Source B)] --> pb([Normalise data and create CSV]);
    pb --> ib[(Intermediate B)];
    ia & ib & a(WIP: Manual annotations by hash) --> c([Compile])
    c-- Some rows can be rejected at this stage --> d[(Dataset)];
Loading

Each loader outputs:

  • a CSV, which is then compiled into a single metadata.csv
  • the files into an intermediate folder

The process above is done so that:

  • Each source is independent
  • We can easily compile a final dataset with different sources
  • Easier to make the split consistent across runs