Speeding up Scaper with in-memory operations

I've done some investigation into speeding up Scaper quite a bit. The key idea is to make everything happen in-memory and most importantly reduce the number of exec calls. I'm writing out the roadmap that we came up with offline here. The milestones we should hit to make this happen are:

Write a profiling script that reliably tells us the speed of Scaper right now. This shouldn't touch any existing code but should be a good benchmarking tool that we can use to report progress on this issue. (PR: #106)
We have soundfile now, so let's use soundfile.info instead of sox.file_info. This avoids a costly exec call. (PR #110)
The current pysox master now can accept and return numpy arrays when doing tfm.build. This API may change slightly, however. See this issue and the associated PR. By updating to this, we will be able to do everything in-memory, but will still have exec calls. (PR #111)
Once we use soundfile to load the audio files, we can get rid of sox.Combiner and should do all the summing of mixtures in-memory. (PR #111)
Since we have soundfile which can read from an offset and a duration, we should use that instead of making calls to tfm.trim, which is not as efficient. tfm.trim decodes the entire file into memory, then trims it, which can be costly for large files, I believe. The same goes for tfm.fade which is slow compared to just multiplying in-memory with a quarter-sine ramp. (PR #117)
I have been working on making Python bindings for Sox with the goal of it being a drop-in replacement for PySox's transformer API. The project is soxbindings. In it, I subclass sox.Transformer to have the same API with respect to Scaper's use case. The bindings are generated via pybind11 and avoid all exec calls. The change at this point should be as simple as import sox -> import soxbindings as sox, after making soxbindings pip-installable with cross-platform wheels. At the moment soxbindings does not work on Windows, but we can make it an optional import in Scaper: if it's installed, use it, otherwise fall back to sox.
We decided that tfm.trim replacement should happen later, as it changes the regression data.
Now we make use of pyloudnorm, which gets around another exec call that's in the current LUFS calculation, which uses ffmpeg. This breaks the regression data as pyloudnorm and ffmpeg LUFS don't match exactly (they're close enough). Everything up to this point should match the current regression data and pass tests. After this, we need to change the regression data.

This issue is related to #76 and #3.

Update, the returns milestone Scaper.generate has been moved to a separate issue: #114.

We can now have the Scaper.generate function return numpy arrays. This is useful for on-the-fly mixing. At this stage, we should also make it so that the Scaper object doesn't have to save jams and audio to disk but can instead just return them. Possibly this could be done via jams_file=None and audio_file=None when passing it into Scaper as arguments.

Looking ahead, @justinsalamon, can you weigh in on API design for this bullet point when you have a second?

We can now have the Scaper.generate function return numpy arrays. This is useful for on-the-fly mixing. At this stage, we should also make it so that the Scaper object doesn't have to save jams and audio to disk but can instead just return them. Possibly this could be done via jams_file=None and audio_file=None when passing it into Scaper as arguments.

@pseeth good call, let's hash this out.

I agree that we should change the default behavior to:

return audio and annotation in memory
don't save to disk

How about:

We make audio_path and jams_path optional, and set to None by default. If a path is provided, it's used to save to disk. This will make these identical to txt_path, and we should probably bunch them together, i.e. place txt_path immediately after jams_path in the function call, and re-order the docstrings accordingly.
no_audio will control whether audio is generated or not (overall), so if it's set to True we (1) return None in memory for the audio array and (2) do not save audio to disk, even if a path is specified.
We'll have to update docs everywhere to reflect this API change.

This will break the API, so again we need to determine if we're bumping to 1.4.0 or 2.0.0.

Awesome, thanks. That all sounds good to me.

What should the return type of generate when save_isolated_events is set to True? Or should we always return like this:

# types: (np.ndarray, list: np.ndarray, JAMS)
return soundscape_audio_array, event_audio_arrays, jam

This set of changes may result in some deprecated functions in Scaper. I wanted to list them here as well as why they are being deprecated for us to keep in mind:

match_sample_length: We will not need this anymore, as the length of different events in the soundscape will match exactly, thanks to doing everything in-memory at the sample level.
get_integrated_lufs: The behavior of this function will change, as we will switch it to using pyloudnorm rather than ffmpeg. It currently takes a file, but in the future will take a numpy array.
r128stats: Related to get_integrated_lufs, this makes a call to ffmpeg on the command line, then interprets the stdout output. Won't be needed after pyloudnorm.
_close_temp_files: This was a useful context manager when we needed temporary fles everywhere. However, we may not need temporary files in the near future! We don't need it in _generate_audio, but it is also called in trim and slice. We can likely remove it there as well, opting for build_array + soundfile operations.

Regarding return API, it should be the same regardless of how the options are set.

Focusing on the most common use case, I'd start with:

soundscape_audio, soundscape_jam

But we also have the text file... I wonder if we should return the list that's used to create it. Then we's return:

soundscape_audio, soundscape_jam, soundscape_list

Only then would I return isolated events, so now we get:

soundscape_audio, soundscape_jam, soundscape_list, event_audio_list

thoughts @pseeth ?

Per offline discussion, let's shift discussion of the return API to a different issue, a different PR.

So replacing tfm.fade operations with the following snippet works great:

scaper/scaper/core.py

Lines 1820 to 1828 in dfb50f5

    
           # Apply short fade in and out 
        
           # (avoid unnatural sound onsets/offsets) 
        
           fade_in_samples =  int(self.fade_in_len * self.sr) 
        
           fade_out_samples = int(self.fade_out_len * self.sr) 
        
           fade_in_window = np.sin(np.linspace(0, np.pi / 2, fade_in_samples)) 
        
           fade_out_window = np.sin(np.linspace(np.pi / 2, 0, fade_out_samples)) 
        
           event_audio[:fade_in_samples] *= fade_in_window 
        
           event_audio[-fade_out_samples:] *= fade_out_window

However, when trying to replace tfm.trim operations in a similar way by removing:

scaper/scaper/core.py

Lines 1781 to 1783 in dfb50f5

    
           tfm.trim(e.value['source_time'], 
        
                       e.value['source_time'] + 
        
                       e.value['event_duration'])

and replacing it with (using SoundFile to read from disk at specified start and stop according to the EventSpec):

# synthesize edited foreground sound event
event_sr = soundfile.info(e.value['source_file']).samplerate
start_sample = int(event_sr * e.value['source_time'])
stop_sample = start_sample + int(event_sr * e.value['source_duration'])

event_audio, event_sr = soundfile.read(
  e.value['source_file'], always_2d=True,
  start=start_sample, stop=stop_sample)

results in test failures only when resampling is needed. This is because sox.Transformer resamples the entire audio file first, then trims it. In the workflow above, where the trim is done when the audio is loaded in, the audio is trimmed first then resampled. These operations can't be swapped (resample -> trim != trim -> resample).

However, for long source audio files, it'd be better to trim before resampling. Otherwise, Scaper would incur the cost of resampling the entire audio file first before trimming it every time it accessed that audio in a soundscape. Making this change would require updating the regression data so I suggest we leave it till the end.

Sounds good, yes, let's avoid anything that touches regression data for now, everything else merged, and then in a separate PR add modifications that touch regression data.

All of this happened!

	# Apply short fade in and out
	# (avoid unnatural sound onsets/offsets)
	fade_in_samples = int(self.fade_in_len * self.sr)
	fade_out_samples = int(self.fade_out_len * self.sr)
	fade_in_window = np.sin(np.linspace(0, np.pi / 2, fade_in_samples))
	fade_out_window = np.sin(np.linspace(np.pi / 2, 0, fade_out_samples))

	event_audio[:fade_in_samples] *= fade_in_window
	event_audio[-fade_out_samples:] *= fade_out_window

	tfm.trim(e.value['source_time'],
	e.value['source_time'] +
	e.value['event_duration'])