Generate soundscapes from images.
To run the project, make sure that Docker is correctly installed on the machine. If it is not already installed, follow these instructions: Docker installation
The project uses Docker and Docker-Compose to provide easy to use prototypes. If Docker-Compose is not already installed on the machine, follow these instructions: Docker-Compose installation
The sound generation module was developed using Scaper. Given a collection of isolated sound events, Scaper acts as a high-level sequencer that can generate multiple soundscapes from a single probabilistically defined specification.
Follow the instructions give in the following link:
The project can be cloned by running the following commands. The latter command is used to retrieve all the contents of the submodules in the project (e.g. the soundbank).
git clone https://github.com/hslu-abiz/soundscape-generation.git
git submodule update --init --recursive
pip install -r requirements.txt
To download the dataset, a cityscapes account is required for the authentification. Such an account can be created
on www.cityscapes-dataset.com. After the registration, run the download_data.sh
script. During the download, it will ask you to provide your email and password for authentification.
./scripts/download_data.sh
For the object detection module a pre-trained ERFNet is used, which is then finetuned on the Cityscapes dataset.
To train the network, run the following command. The hyperparameters epoch and batch size can be configured in
the docker-compose.yml
file. To load a pre-trained model, specify its path in the MODEL_TO_LOAD
variable. If the
variable is None
, the model is trained from scratch.
docker-compose up train_object_detection
Run the following command to predict the semantic segmentation of every image in the --test_images
directory (note:
predictions are saved with the same name and a _pred.jpg
suffix). Ensure that you specify the correct image's file
type in --test_images_type
.
docker-compose up predict_object_detection
To evaluate the segmentation network, run the command below.
docker-compose up evaluation
To generate soundscapes of every image in the --test_images
directory, run the following command. The generated audios
will be saved in data/soundscapes
. Ensure that you specify the correct image's file type in --test_images_type
.
docker-compose up sound_generation
The above predictions are produced by a network trained for 67 epochs that achieves a mean class IoU score of 0.7084 on the validation set. The inference time on a Tesla P100 GPU is around 0.2 seconds per image. The model was trained for 70 epochs on a single Tesla P100. After the training, the checkpoint that yielded to highest validation IoU score was selected. The progression of the IoU metric is shown below.
- J. Salamon, D. MacConnell, M. Cartwright, P. Li and J. P. Bello, "Scaper: A library for soundscape synthesis and augmentation," 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 344-348, DOI: 10.1109/WASPAA.2017.8170052.
- E. Romera et al., "ERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic Segmentation", 2017
- Official PyTorch implementation of ERFNet