maxrmorrison/pyfoal

Splicing beginning and end silence

Closed this issue · 4 comments

I'm trying to cut out the silence from the beginning and end of the files, but i run into some issues with this splicing. My steps are the following:

  1. run the alignment and produce the alignment.json file for a specific label.json and wav file (already provided in the drive file)
  2. splice out the beginning and end if it's "sp"
  3. run the forced alignment again on this new spliced wav file (without the silence) with the original label.json
  4. the alignment should have no silence at beginning or end, but i've observed an error

Following is the code i used with the attached methods, was hoping you could try reproducing and pointing out why I can't splice out the silence.

Here is the drive containing the files:
https://drive.google.com/drive/folders/1tj6nHljdxZbyghsUm7WNNqI6kLW_SMZV?usp=sharing

import os
import glob
from pydub import AudioSegment
import pyfoal

def remove_silence(directory, ext = '.wav', json_file = 'alignment.json'):
    wav_path = sorted([os.path.abspath(path) for path in glob.glob(os.path.join(directory, f'**/*{ext}') , recursive = True)])
    json_path = sorted([os.path.abspath(path) for path in glob.glob(os.path.join(directory, f'**/*{json_file}'), recursive = True)])

    for json_file, wav_file in zip(json_path, wav_path):
        f = open(json_file)
        json_object = json.load(f)
        first_value = [value for key,value in json_object['words'][0].items()]
        last_value = [value for key, value in json_object['words'][-1].items()]
        final_sound = AudioSegment.from_file(wav_file)
        if first_value[0] == 'sp' and last_value[0] == 'sp':
            print(len(final_sound))
            print(wav_file)
            ending_time_first_silence = first_value[2]
            print("Beginning frame end sp " + str(ending_time_first_silence))
            starting_time_last_silence = last_value[1]
            print("Last frame start " + str(starting_time_last_silence))
            new_starting_time = ending_time_first_silence * 1000
            new_ending_time = starting_time_last_silence * 1000
            final_sound = final_sound[new_starting_time:new_ending_time]
            final_sound.export(wav_file, format = 'wav')
        
        elif first_value[0] == 'sp' and last_value[0] != 'sp':
            print(len(final_sound))
            print(wav_file)
            ending_time_first_silence = first_value[2]
            print("Beginning frame end sp " + str(ending_time_first_silence))
            new_starting_time = ending_time_first_silence* 1000
            final_sound = final_sound[new_starting_time:]
            final_sound.export(wav_file, format = 'wav')
    
        elif first_value[0] != 'sp' and last_value[0] == 'sp':
            print(len(final_sound))
            print(wav_file)
            starting_time_last_silence = last_value[1]
            print("Last frame start " + str(starting_time_last_silence))
            new_ending_time = starting_time_last_silence * 1000
            final_sound = final_sound[:new_ending_time]
            final_sound.export(wav_file, format = 'wav')


def clean_adjusted_alignment(directory, wav_ext = '.wav',  json_ext = 'label.json'):
    wav_path = sorted([os.path.abspath(path) for path in glob.glob(os.path.join(directory, f'**/*{wav_ext}'), recursive = True)])

    missing_wav_files = []
    missing_json_files = []
    missing_alignment_files = []

    for files in wav_path:
        if not os.path.exists(os.path.join(os.path.dirname(files), "alignment2.json")):      
            missing_wav_files.append(files)
            missing_json_files.append(os.path.join(os.path.dirname(files), 'label.json'))
            missing_alignment_files.append(os.path.join(os.path.dirname(files), 'alignment2.json'))

    print(missing_wav_files)

    pyfoal.from_files_to_files(missing_json_files, missing_wav_files, missing_alignment_files)



if __name__ == '__main__':

    remove_silence('AK2/')
    clean_adjusted_alignment('AK2/')

After the above runs, there should be 0 silence in front and end of the file, but for some reason that's not the case. As we see on 'alignment2.json', there is still silence for some reason.

TLDR:

The first forced alignment ran perfectly. When i tried to splice the beginning and end silence out of the wav file, i got an issue with the alignment when i retried. I'm not sure why this happens to a few files and doesn't with others

I wouldn't recommend using forced alignment for removing silences. The quantization step of P2FA is ~30 ms, which is too large for that kind of task. You could try MFA, which I have in the dev branch. Or better yet, use a loudness curve or voice activity detection. Another library I made, torchcrepe, has a function torchcrepe.loudness.a_weighted, which is pretty fast to run and gives per-frame loudness values at a hopsize that you can specify (so you control the granularity). Personally, I'd start with voice activity detection (VAD) for silence removal (e.g., torchaudio.functional.vad). This will also be much faster than using forced alignment.

Isn't Forced alignment better than almost any other method because it will tell you the timestamps where each letter in the transcript occurs so you can just trim as soon as the first letter starts being spoken, instead of guessing what volume the file starts at or stuff like that? Controlling granularity does sound interesting, but just curious about precision.

Forced alignment is not better. It's much slower and the timestamps aren't as accurate. I.e., the precision is worse.

Thank you for the help!