/yo

Primary LanguageHTMLMIT LicenseMIT

Approach

  1. Clustering
  2. Optimization
  3. Smoothening
Clustering
  1. The Gap Threshold (G):
    This is a user-defined parameter (gap_param with a default value 1.3), that acts as a threshold to decide whether two consecutive dialogues belong to the same cluster or different clusters. It's a measure of temporal distance in seconds.

  2. Clustering Logic:
    The algorithm sorts voiceover dialogues into groups (clusters) where each cluster consists of dialogues that follow each other with a time gap smaller than or equal to the gap threshold (G).

Mathematically, the clustering can be defined as follows:

  • Let D = [d0, d1, ..., dn] be a list of dialogues, where each di has start_time and end_time.
  • A pair of dialogues (di, dj) belongs to the same cluster if dj.start_time - di.end_time <= G, where i < j.
  • If dj.start_time - di.end_time > G, then dj starts a new cluster.
Optimization

Objective Function:
The objective function to be minimized is defined as the sum of the squares of the differences between subsequent variables in a list. For a list of variables vars with N elements, the objective function f(vars) can be written as:
f(vars)=∑i=0N−2(vars[i+1]−vars[i])2f(vars) = \sum_{i=0}^{N-2} (vars[i+1] - vars[i])^2f(vars)=i=0N2(vars[i+1]vars[i])2
The goal of the optimization is to find the values of vars that minimize this objective function.

Equality Constraint:
The problem includes an equality constraint which is the sum of a series of fractions involving the variables L[i] and vars[i] subtracted by a constant series involving L[i] and A[i]. The equality constraint g(vars) must be satisfied such that:
g(vars)=∑i=0N−1L[i]vars[i]−∑i=0N−1L[i]A[i]=0g(vars) = \sum_{i=0}^{N-1} \frac{L[i]}{vars[i]} - \sum_{i=0}^{N-1} \frac{L[i]}{A[i]} = 0g(vars)=i=0N1vars[i]L[i]i=0N1A[i]L[i]=0
This means that the sum of L[i]/vars[i] over all elements must equal the sum of L[i]/A[i] over all elements.

Inequality Constraints:
There are also inequality constraints which are expressed as a sequence of inequalities that relate consecutive variables based on the parameters L, A, and R. The inequality constraints h(vars) must be such that:
hi(vars)={R+(L[0]vars[0]−L[0]A[0])≥0for i=0R−(L[0]vars[0]−L[0]A[0])≥0for i=1Subsequent constraints build upon the previous onesfor i≥2h_i(vars) = \left\{ \begin{array}{ll} R + \left( \frac{L[0]}{vars[0]} - \frac{L[0]}{A[0]} \right) \geq 0 & \text{for } i=0 \\ R - \left( \frac{L[0]}{vars[0]} - \frac{L[0]}{A[0]} \right) \geq 0 & \text{for } i=1 \\ \text{Subsequent constraints build upon the previous ones} & \text{for } i \geq 2 \\ \end{array} \right.hi(vars)=R+(vars[0]L[0]A[0]L[0])0R(vars[0]L[0]A[0]L[0])0Subsequent constraints build upon the previous onesfor i=0for i=1for i2

These constraints enforce a maximum allowable deviation (R) between the inverse of the variables vars[i] and the inverse of some reference values A[i].

Smoothing

Follow Duration Manipulation with Praat Software

Praat Automatically do linear smoothening. So we have to just pass which position we should give what spped. Our algorithm decides that.

Mathematically, for a given speed-up factor s for a segment, setting a DurationTier point to 1/s means that the segment will be played back at a rate s times its original speed. If s > 1, the playback is faster; if s < 1, the playback is slower. By adjusting these points gradually, the change in playback speed can be made smoother, so it's less jarring to listeners. The script mathematically calculates the positions of the DurationTier points to create a linear transition in playback speed between adjacent audio segments.

he position of the points in the DurationTier is calculated based on the relaxation value (relaxation), speed-up factors (s1, s2, s3), and the total duration of the original sound (total_duration). The relaxation value is used to determine a transition region where the speed changes gradually.

For the first sound file:

  • We calculate relaxation based on the current and next duration.
  • r is calculated as the minimum of relaxation / s1 and relaxation / s2, ensuring that the transition region does not exceed the designated relaxation period and remains proportional to the speed-up factors.
  • Two points are then added to the DurationTier to define the start and end of the transition:
    • Start point: call(duration_tier, "Add point", 0.00, 1/s1)
    • End of transition: call(duration_tier, "Add point", total_duration - r * s1, 1/s1)
  • A third point is added at the end of the audio (total_duration), at a speed (1/speed) that is an average of s1 and s2, weighted by their relative difference:
    • speed = s1 + s1 * (s2 - s1) / (s1 + s2)

For the middle sound files, similar calculations are performed but with additional consideration for transitions at both the beginning and the end.

For the last sound file:

  • The starting point for the DurationTier is set to an intermediary speed (1/speed) instead of starting at 1/s2, and the transition end uses the same logic as above.

Mathematically, we are defining a linear interpolation for the speed change over the transition region. Let's define:

  • t0: Start time of the transition region.
  • t1: End time of the transition region within the total_duration.
  • s(t): Speed function over time.

For the first sound file, we set t0 = total_duration - r * s1 and s(t0) = s1. We want to find s(t1) at t1 = total_duration.

We can then define a linear interpolation of speed over time as follows:

s(t)=s(t0)+(s(t1)−s(t0))(t1−t0)⋅(t−t0)s(t) = s(t0) + \frac{(s(t1) - s(t0))}{(t1 - t0)} \cdot (t - t0)s(t)=s(t0)+(t1t0)(s(t1)s(t0))(tt0)

Where ttt falls within t0t0t0 and t1t1t1, which defines a straight line between the points (t0,s(t0))(t0, s(t0))(t0,s(t0)) and (t1,s(t1))(t1, s(t1))(t1,s(t1)) on a graph of speed over time.

By setting points at t0t0t0 and t1t1t1, we're effectively dictating the slope of our line, creating a gradual transition from one speed to another over the duration of the relaxation period, avoiding sudden changes in playback speed.

Audio Outputs

Issue 3

  • Audio Qaulity Seems to be dropped
  • Tested with no smootening
  • Using Ffmpeg for speed up

Samples

With Smoothening Without Smoothening
Id (1656) Video Video

All Algo Comparasion

Samples

Semantic Shortening Our New Algo Our New Algo + Semantic Shortening ILP ILP + Semantic Shortening
Id (1695) Video Video Video

Issue 2

  • Play back rate issue: In previous algorthim we dont handle f we have less than 1x playback rate
  • Handled this in preprocessing step (Add silence if output audio length is less than src audio)
  • Change the Optmization Algorithm to have constraints on play back rate < 1

Samples

Previous Algo Fixed Algo TimeStamps
Id (1398) Audio Audio 0:31 - 0:35

Issue 1

  • Cut off Issue: Sometimes there is 10ms to 1sec cut off in dialgoue due to precision
  • Handle this in post-processing step
  • if cut off is there then we speed up the dialogue little bit more to fit inside that dialogue frame

Samples

Baseline Previous Algo Fixed Algo TimeStamps
Id (1411) Audio Audio Audio 2:04 - 2:15

Approach 3

  • Audio Smoothenting + Clsuter Based
  • In this we assume for a particlular cluster start and end time will be preserved
Baseline Clustering + Realxation Clustering + Smoothening Clustering + Realxation + Smoothening
Id(4410) Audio Audio Audio Audio
Id(4514)_id Audio , Video Audio Audio Audio , Video
Id(4514)_hi Audio Audio Audio Audio

SpeedUpOptimisation

Baseline

Clustering + Realxation + Smoothening

Clustering + Realxation

Clustering + Smoothening

Approach 2

  • audio smoothening + Video Smoothening
  • In this we assume every dailgoue's start and end time will be preserved
  • Here Gap paramter mean if less than this we will add this to previous cluster segment
Baseline
Only Audio Smoothening
Video Smooth + Audio Smooth (Gap Paramter 1.2)
Video Smooth + Audio Smooth (Gap Paramter 3)

Approach 1

  • Only audio smoothening
  • In this we assume every dailgoue's start and end time will be preserved
  • Gap paramter(G) mean if less than this we will add this to previous cluster segment
  • Relaxation paramter(R) mean when smootehnng we take this much relaxation

Baseline

Audio

R(0.1) R(0.2) R(0.4) R(0.5) R(0.7) R(0.9) R(1.0) R(1.2) R(1.5)
G(0.5) Audio Audio Audio Audio Audio Audio Audio Audio Audio
G(0.8) Audio Audio Audio Audio Audio Audio Audio Audio Audio
G(1.0) Audio Audio Audio Audio Audio Audio Audio Audio Audio
G(1.2) Audio Audio Audio Audio Audio Audio Audio Audio Audio
G(1.5) Audio Audio Audio Audio Audio Audio Audio Audio Audio
G(1.8) Audio Audio Audio Audio Audio Audio Audio Audio Audio
G(2.0) Audio Audio Audio Audio Audio Audio Audio Audio Audio