1- pick one content from the subtitle data where the start time equals to T
2- select from the ASR data the content to be compared where their start time are in the interval [t-2min,t+2min] (most of the delay between the ASR data and the subtitle is not very big )
3- for each pair, we calculate the matching score (to calculate the matching score i used FuzzyWuzzy Python library which is used for string matching)
4- keep the pairs with greatest score
python3 preprocess_data.py --pa_subtitles 'pa_subtitles.csv' --INA_subtitles 'INA_subtitles.json'
python3 alignement_task.py
python3 alignement_task_13h_delay.py