anidl/multi-downloader-nx

[Feedback]: Enchantment ideas for the multi-audio Crunchyroll (or any video)

Opened this issue · 2 comments

Type

Both

Suggestion

I found this project recently because of the DRM drama. Since it works while yt-dlp does not, I have noticed a few issues and fixed them in my custom app. One of the issues that I noticed right away is proper multi-audio support.

I did try only a few shows, and I noticed that sometimes both languages (Eng and Jap, that I tried) downloaded, but only the first language stayed because the lengths of them both are not the same. My assumption (without checking the code) is that if both videos are the same length, then they are merged, but if the length is not the same, then merge silently fails.

In reality, CR videos between languages often have different sizes (except for some of the latest ones), with the reason being that different languages have different logos at the start of the video, and then the rest is fine. Sometimes same is also at the end, but mostly at the beginning.

I got some code (in Python, not typescript) that can combine around 95% of CR videos automagically and 99% with a little manual tweaking (there are always some with extra clips in the middle of the video, and I never figured that out).

If something is of Interest (and I would like to help improve this app to the levels I had my personal app with yt-dlp before DRM), then I am willing to share general logic ideas or even some Python code.

Please let me know if that might be interesting.

P.S. Another big issue I have noticed about problems with the same language double subtitles, but I will create separate ticked about that.

You can actually use --syncTiming, and for most videos it will sync the audio timing to the video (assuming I remembered to implement it for MPDs). I would like to hear how it was working/the general ideas in your python script though, as it may help to improve the syncTiming flag.

Alternatively, if you want to keep all the videos, that's also a flag (--keepAllVideos). You can also choose to just not download the extra videos with the --dlVideoOnce flag. I definitely recommend checking over the available flags in the documentation: https://github.com/anidl/multi-downloader-nx/blob/master/docs/DOCUMENTATION.md

It is good to know about some commands. I started with GUI, but I need to check CLI also.

Short version of audio sync.

Because audio comes in different languages, I decided that it is hard to do by audio, so I did it all by video. My assumption is - as long as I can match video frames, Audio will be fine.

With that in mind, using ffmpeg, I process the beginning and the end of the video (usually 30, 60 or 120 seconds; longer might be more precise but a little more work). There, for the duration, I extracted keyframes that come with the actual frame number, and it is possible to calculate time from the frame. Then I look for the frameset (2 or 3 frames) that has the same content between both files. For example, if a Japanese video starts with an opening song while an English video starts with the company logo and opening song, then you will eventually find a pair of frames that have the same content from the opening song in both of them. For comparison, I used image hash.
After the pair is found in each video, I calculate the difference between them (usually the length of the extra logo, etc.).
After this, I repeat the same procedure at the end of the video.

Next step is to compare offset (what was the difference between frame pairs in the beginning and the end). In the ideal case, the offset difference between the start and end is below 0.05s, and it is perfect. Sometimes, it is around 0.8s, and that is still good. In rare cases, it might be > 1s, and that is not good and ideal for the manual check, or sometimes something goes wrong, and a match is not found, or the difference is like the 20s, then nothing will work. For these bad cases, I had an option in the code where I could create a giant JPG image with keyframes and time and check manually with the ability to override the offset.

After all this, just use calculated offset for each "audio" for both audio and subtitles (there is still an issue with subtitles that I have not mentioned, like if you have a Japanese video that comes with English subs and English audio, that also comes with English subs [subs cover signs and other onscreen text, not dialogue])

Technical version
I used Claude 3 Opus to convert Python code to TypeScript. Python code also included.

Getting frame data code
import * as fs from 'fs';
import * as path from 'path';
import * as imagehash from 'image-hash';
import { Image } from 'image-js';
import * as math from 'mathjs';

interface StartFrame {
  time: number;
  frame: number;
  hash: string;
}

class VideoProcessor {
  private video_file_path: string;
  private video_height: number;
  private video_width: number;
  private fps: number;
  private duration: number;
  private check_seconds: number;
  private use_third: boolean;
  private debug: boolean;
  private temp: string;
  private language: { code3: string };
  private start_frames: StartFrame[];
  private end_frames: StartFrame[];

  constructor(/* ... */) {
    // Initialize the properties in the constructor
    // ...
  }

  public async processing_frames(): Promise<void> {
    console.log("Processing Frames");
    const process_time = this.check_seconds;
    const offset_time = (Math.round(this.duration * 10) / 10 - 120 - process_time);

    const process = async (duration: number, offset: number, prefix: string): Promise<StartFrame[]> => {
      // Start time for measuring processing duration
      const start = Date.now();

      // Extract frames from the video using ffmpeg
      const { stdout, stderr } = await new Promise<{ stdout: Buffer; stderr: string }>((resolve, reject) => {
        ffmpeg(this.video_file_path)
          .input(this.video_file_path)
          .seekInput(offset)
          .videoFilters([
            {
              filter: 'select',
              options: 'gt(scene\\,0.1)'
            },
            {
              filter: 'showinfo'
            }
          ])
          .output('pipe:')
          .outputOptions([
            '-t', duration.toString(),
            '-fps_mode', 'vfr',
            '-frame_pts', '1',
            '-f', 'rawvideo',
            '-pix_fmt', 'rgb24'
          ])
          .on('error', (err, stdout, stderr) => {
            reject(err);
          })
          .on('end', (stdout, stderr) => {
            resolve({ stdout, stderr });
          })
          .run();
      });

      // End time for measuring processing duration
      const end = Date.now();
      if (this.debug) {
        console.log(`Time to extract frame data: ${end - start}ms`);
      }

      // Decode the error output from ffmpeg
      const err = stderr.toString('utf-8');

      // Initialize lists to store timeframes, start frames, and start video frames
      const timeframes: number[] = [];
      const start_frames: StartFrame[] = [];
      const start_video: Uint8Array[] = [];

      // Convert the raw video data to a Uint8Array
      const temp_start_video = new Uint8Array(stdout).reduce((resultArray, item, index) => {
        const chunkIndex = Math.floor(index / (this.video_width * this.video_height * 3));
        if (!resultArray[chunkIndex]) {
          resultArray[chunkIndex] = [];
        }
        resultArray[chunkIndex].push(item);
        return resultArray;
      }, []);

      // If using the middle third of the frames, crop the frames accordingly
      if (this.use_third) {
        for (const frame of temp_start_video) {
          const im_pil = await Image.load(frame);
          const crop_x1 = 0;
          const crop_x2 = im_pil.width;
          const crop_y1 = Math.floor(im_pil.height * 0.333);
          const crop_y2 = Math.floor(im_pil.height * 0.666);
          const third_image = im_pil.crop({ x: crop_x1, y: crop_y1, width: crop_x2 - crop_x1, height: crop_y2 - crop_y1 });
          start_video.push(await third_image.toUint8Array());
        }
      } else {
        start_video.push(...temp_start_video);
      }

      // Extract timeframes from the ffmpeg error output
      const regex = /pts_time(.*?)duration/g;
      let match;
      while ((match = regex.exec(err)) !== null) {
        const timeStr = match[1].trim();
        const time = parseFloat(timeStr);
        timeframes.push(time);
      }

      // Create a list of start frames with their corresponding time, frame number, and image hash
      for (let i = 0; i < start_video.length; i++) {
        const time = timeframes[i] + offset;
        const frame = Math.floor(time * this.fps);
        const hash = await imagehash.hash(await Image.load(start_video[i]), 8, 'binary');
        start_frames.push({ time, frame, hash });
      }

      // If in debug mode, create a collage of the start frames and save it
      if (this.debug) {
        const details = path.join(this.temp, `v_${this.language.code3}_${prefix}.png`);
        if (fs.existsSync(details)) {
          fs.unlinkSync(details);
        }
        const base_image = await Image.load(start_video[0]);
        const new_height = 320;
        const new_width = Math.floor(new_height / base_image.height * base_image.width);
        const horizontal_images = 5;
        const vertical_images = Math.ceil(start_video.length / horizontal_images);
        const collage = new Image(new_width * horizontal_images, new_height * vertical_images);
        let image_id = 0;

        // Iterate over the start frames and paste them onto the collage
        for (let y = 0; y < vertical_images; y++) {
          for (let x = 0; x < horizontal_images; x++) {
            if (image_id < start_frames.length) {
              const frame_image = await Image.load(start_video[image_id]);
              const resized_image = frame_image.resize({ width: new_width, height: new_height });
              collage.drawImage(resized_image, { x: x * new_width, y: y * new_height });
              collage.drawText(`${start_frames[image_id].time} (${start_frames[image_id].frame})`, { x: x * new_width, y: y * new_height, color: [255, 0, 0] });
              image_id++;
            }
          }
        }
        await collage.save(details);
      }

      // Return the list of start frames
      return start_frames;
    };

    // Process the start frames
    this.start_frames = await process(process_time, 0, 'start');

    // Process the end frames
    this.end_frames = await process(process_time, offset_time, 'end');
  }
}
Getting frame data code Python version
def processing_frames(self):
    print(f"Processing Frames")
    process_time = self.check_seconds
    offset_time = (round(self.duration, 4) - 120 - process_time)

    def process(duration, offset, prefix):
        # Start time for measuring processing duration
        start = time.time()
        
        # Extract frames from the video using ffmpeg
        out, err = (
            ffmpeg
            .input(self.video_file_path, ss=offset)
            .filter('select', 'gt(scene, 0.1)')
            .filter('showinfo')
            .output('pipe:', t=duration, fps_mode='vfr', frame_pts=True, format='rawvideo', pix_fmt='rgb24')
            .run(capture_stdout=True, capture_stderr=True)
        )
        
        # End time for measuring processing duration
        end = time.time()
        if self.debug:
            print("Time to extra frame data: {}".format(end - start))
        
        # Decode the error output from ffmpeg
        err = err.decode("utf-8")
        
        # Initialize lists to store timeframes, start frames, and start video frames
        timeframes = []
        start_frames = []
        start_video = []
        
        # Convert the raw video data to a numpy array
        temp_start_video = (
            np
            .frombuffer(out, np.uint8)
            .reshape([-1, self.video_height, self.video_width, 3])
        )
        
        # If using the middle third of the frames, crop the frames accordingly
        if self.use_third:
            for frame in temp_start_video:
                im_pil = Image.fromarray(frame)
                crop_x1 = 0
                crop_x2 = im_pil.width
                crop_y1 = int(im_pil.height * 0.333)
                crop_y2 = int(im_pil.height * 0.666)
                third_image = im_pil.crop((crop_x1, crop_y1, crop_x2, crop_y2))
                start_video.append(np.array(third_image))
        else:
            start_video = temp_start_video

        # Extract timeframes from the ffmpeg error output
        for line in iter(err.splitlines()):
            if 'pts_time' in line and 'duration' in line:
                timeframes.append(
                    float(re.findall("\d+(?:\.\d+)?", re.search('pts_time(.*)duration', line).group(1))[0]))
        
        # Create a list of start frames with their corresponding time, frame number, and image hash
        for i in range(len(start_video)):
            start_frames.append({'time': timeframes[i] + offset, 'frame': int((timeframes[i] + offset) * self.fps),
                                 'hash': imagehash.phash(Image.fromarray(start_video[i]))})
        
        # If in debug mode, create a collage of the start frames and save it
        if self.debug:
            details = f'{self.temp}{os.path.sep}v_{self.language.code3}_{prefix}.png'
            if os.path.exists(details):
                os.remove(details)
            base_image = Image.fromarray(start_video[0])
            new_height = 320
            new_width = int(new_height / base_image.height * base_image.width)
            horizontal_images = 5
            vertical_images = math.ceil(len(start_video) / horizontal_images)
            collage = Image.new("RGBA", (new_width * horizontal_images, new_height * vertical_images))
            image_id = 0
            font = ImageFont.truetype(r"C:\Windows\Fonts\arial.ttf", 24)
            draw = ImageDraw.Draw(collage)

            # Iterate over the start frames and paste them onto the collage
            for y in range(vertical_images):
                for x in range(horizontal_images):
                    if len(start_frames) > image_id:
                        collage.paste(Image.fromarray(start_video[image_id]).convert("RGBA").resize((new_width, new_height)), (x * new_width, y * new_height))
                        draw.text((x * new_width, y * new_height), text=f'{start_frames[image_id]["time"]} ({start_frames[image_id]["frame"]})', font=font, allign="left",
                                  fill="red")
                        image_id += 1
            collage.save(details)
        
        # Return the list of start frames
        return start_frames

    # Process the start frames
    self.start_frames = process(process_time, 0, 'start')
    
    # Process the end frames
    self.end_frames = process(process_time, offset_time, 'end')

Offset calculation in the provided code works as checking between current video and provided. So, if provided is the same as current, then nothing is calculated.

Comparing frame data
interface FrameData {
    hash: number;
    time: number;
}

interface VideoData {
    start_frames: FrameData[];
    end_frames: FrameData[];
    language: {
        english_name: string;
    };
    threshold: number;
    forced_offset: boolean;
    forced_offset_value: number;
}

function calculate_offset(base_data: VideoData): void {
    /**
     * Calculate the offset between base_data and self.
     * @param base_data The base video data to compare against.
     */
    if (base_data === this) {
        return;
    }

    function compare_frames(base_frames: FrameData[], current_frames: FrameData[], reverse = false): [boolean, number] {
        /**
         * Compare frames between base_frames and current_frames to find the offset.
         * @param base_frames Frames from the base video data.
         * @param current_frames Frames from the current video data.
         * @param reverse Whether to compare frames in reverse order.
         * @return A tuple indicating if an offset is found and the offset value.
         */
        let have_offset = false;
        let offset = 0;
        const hash_threshold = 1;

        // Iterate over frames in the specified order
        for (let i = reverse ? base_frames.length - 2 : 0; reverse ? i >= 0 : i < base_frames.length - 1; reverse ? i-- : i++) {
            const base_index = i;
            const base_second_index = i + 1;
            let base_check_value = 64;
            let pair_check_value = 64;
            let pair_index = 0;
            let pair_second_index = 0;

            // Compare hash values of frames between base_frames and current_frames
            for (let j = reverse ? current_frames.length - 1 : 0; reverse ? j >= 0 : j < current_frames.length; reverse ? j-- : j++) {
                let hash_diff = base_frames[base_index].hash - current_frames[j].hash;

                if (hash_diff < hash_threshold && hash_diff < base_check_value) {
                    base_check_value = hash_diff;
                    pair_index = j;
                }

                hash_diff = base_frames[base_second_index].hash - current_frames[j].hash;

                if (hash_diff < hash_threshold && hash_diff < pair_check_value) {
                    pair_check_value = hash_diff;
                    pair_second_index = j;
                }
            }

            // Check if consecutive frames match
            if (pair_index + 1 === pair_second_index) {
                have_offset = true;
                break;
            }
        }

        if (have_offset) {
            offset = base_frames[base_index].time - current_frames[pair_index].time;
        }

        return [have_offset, offset];
    }

    // Calculate offset using start frames if not forced
    if (!this.forced_offset) {
        [this.have_offset, this.offset] = compare_frames(base_data.start_frames, this.start_frames, true);

        // Calculate offset using end frames and check tolerance
        const [have_end_offset, end_offset] = compare_frames(base_data.end_frames, this.end_frames, true);
        const end_tolerance = this.threshold;
        const check_end = Math.abs(Math.abs(this.offset) - Math.abs(end_offset)) < end_tolerance;
        this.have_offset = this.have_offset && check_end;
        this.offset = Math.round(this.offset * 10000000) / 10000000;

        console.log(`${this.language.english_name} offset ${this.offset} is used - ${this.have_offset ? "\x1b[32m" : "\x1b[31m"}${this.have_offset}${this.have_offset ? "\x1b[0m" : "\x1b[0m"}. End tolerance ${Math.abs(Math.abs(this.offset) - Math.abs(end_offset))} (needed ${end_tolerance} to pass)`);
    } else {
        // Use forced offset value
        this.have_offset = true;
        this.offset = this.forced_offset_value;
        console.log(`Using \x1b[34mForced offset\x1b[0m \x1b[33m${this.offset}\x1b[0m`);
    }
}
Comparing frame data in Python
def calculate_offset(self, base_data):
    """
    Calculate the offset between base_data and self.
    :param VideoData base_data: The base video data to compare against.
    """
    if base_data == self:
        return

    def compare_frames(base_frames, current_frames, reverse=False):
        """
        Compare frames between base_frames and current_frames to find the offset.
        :param base_frames: Frames from the base video data.
        :param current_frames: Frames from the current video data.
        :param reverse: Whether to compare frames in reverse order.
        :return: A tuple indicating if an offset is found and the offset value.
        """
        have_offset = False
        offset = 0
        hash_threshold = 1

        # Iterate over frames in the specified order
        for i in reversed(range(len(base_frames) - 1)) if reverse else range(len(base_frames) - 1):
            base_index = i
            base_second_index = i + 1
            base_check_value = 64
            pair_check_value = 64
            pair_index = 0
            pair_second_index = 0

            # Compare hash values of frames between base_frames and current_frames
            for j in reversed(range(len(current_frames))) if reverse else range(len(current_frames)):
                hash_diff = base_frames[base_index]['hash'] - current_frames[j]['hash']

                if hash_diff < hash_threshold and hash_diff < base_check_value:
                    base_check_value = hash_diff
                    pair_index = j

                hash_diff = base_frames[base_second_index]['hash'] - current_frames[j]['hash']

                if hash_diff < hash_threshold and hash_diff < pair_check_value:
                    pair_check_value = hash_diff
                    pair_second_index = j

            # Check if consecutive frames match
            if pair_index + 1 == pair_second_index:
                have_offset = True
                break

        if have_offset:
            offset = base_frames[base_index]['time'] - current_frames[pair_index]['time']

        return have_offset, offset

    # Calculate offset using start frames if not forced
    if not self.forced_offset:
        self.have_offset, self.offset = compare_frames(base_data.start_frames, self.start_frames, True)

        # Calculate offset using end frames and check tolerance
        have_end_offset, end_offset = compare_frames(base_data.end_frames, self.end_frames, True)
        end_tolerance = self.threshold
        check_end = abs(abs(self.offset) - abs(end_offset)) < end_tolerance
        self.have_offset = self.have_offset and check_end
        self.offset = round(self.offset, 7)

        console.print(f'{self.language.english_name} offset {self.offset} is used - '
                      f'{"\[green\]" if self.have_offset else "\[red\]"}{self.have_offset}'
                      f'{"\[/green\]" if self.have_offset else "\[/red\]"}. '
                      f'End tolerance {abs(abs(self.offset) - abs(end_offset))} (needed {end_tolerance} to pass)')
    else:
        # Use forced offset value
        self.have_offset = True
        self.offset = self.forced_offset_value
        console.print(f'Using \[blue\]Forced offset\[/blue\] \[yellow\]{self.offset}\[/yellow\]')

I am sure I forgot some details about the code and will be happy to answer anything.