slp-rl/aero

Logic for determining number of examples is off if segment and stride numbers differ

Opened this issue · 0 comments

You'll get torchaudio loading errors if the segment and stride numbers don't match. This is because the number of examples is being calculated incorrectly here:

aero/src/data/audio.py

Lines 29 to 32 in a76210b

elif pad:
examples = int(math.ceil((file_length - self.length) / self.stride) + 1)
else:
examples = (file_length - self.length) // self.stride + 1

self.length represents the number of samples retrieved per segment, while self.stride is the number of samples between the start of one segment and the start of the next (the values are specified in seconds in the experiment yaml file, but here they are a number of samples).

Just to give an example, suppose I have a 44.1 khz file with 3552634 samples (approximately 80 seconds). If we have a stride of 3 seconds (132300 samples) and a segment size of 2 seconds (88200 samples), we would calculate 28 examples with padding and 27 without (I think it's technically one less in each case because of how the loop indexes, but you get the idea). This results in a load outside the size of the sound file ((28-1)*132300=3572100, or at the start of the 81st second). The non-padding scenario actually should have enough room to work with in this case (the last example ends at sample 3528000), but there could be some cases where it fails (perhaps on a sound file with length exactly the same as what's calculated). Using the larger of the two numbers for all calculations should prevent any loading errors, but could leave some potential examples unused.

I drafted a quick spreadsheet that should allow you to experiment with different values: https://docs.google.com/spreadsheets/d/1Fb4jawXv-MRwmp9usFhPs61199-qpo3nwp_8OkPmlJQ/edit?usp=sharing

In case you're wondering why I tried this, I wanted to see if switching segment lengths on the same model A. Worked and B. See if it helps "diversify" the data points (image upscalers typically pick a random region of the image on each epoch of training, I figured that might help here). A is definitely true, and B is promising-I've only run about 5 epochs, but it may have reduced some of the buzzy artifacts I've (and I think other folks) had issues with.