[Feedback]: Enchantment ideas for the multi-audio Crunchyroll (or any video)
Opened this issue · 2 comments
Type
Both
Suggestion
I found this project recently because of the DRM drama. Since it works while yt-dlp does not, I have noticed a few issues and fixed them in my custom app. One of the issues that I noticed right away is proper multi-audio support.
I did try only a few shows, and I noticed that sometimes both languages (Eng and Jap, that I tried) downloaded, but only the first language stayed because the lengths of them both are not the same. My assumption (without checking the code) is that if both videos are the same length, then they are merged, but if the length is not the same, then merge silently fails.
In reality, CR videos between languages often have different sizes (except for some of the latest ones), with the reason being that different languages have different logos at the start of the video, and then the rest is fine. Sometimes same is also at the end, but mostly at the beginning.
I got some code (in Python, not typescript) that can combine around 95% of CR videos automagically and 99% with a little manual tweaking (there are always some with extra clips in the middle of the video, and I never figured that out).
If something is of Interest (and I would like to help improve this app to the levels I had my personal app with yt-dlp before DRM), then I am willing to share general logic ideas or even some Python code.
Please let me know if that might be interesting.
P.S. Another big issue I have noticed about problems with the same language double subtitles, but I will create separate ticked about that.
You can actually use --syncTiming, and for most videos it will sync the audio timing to the video (assuming I remembered to implement it for MPDs). I would like to hear how it was working/the general ideas in your python script though, as it may help to improve the syncTiming flag.
Alternatively, if you want to keep all the videos, that's also a flag (--keepAllVideos). You can also choose to just not download the extra videos with the --dlVideoOnce flag. I definitely recommend checking over the available flags in the documentation: https://github.com/anidl/multi-downloader-nx/blob/master/docs/DOCUMENTATION.md
It is good to know about some commands. I started with GUI, but I need to check CLI also.
Short version of audio sync.
Because audio comes in different languages, I decided that it is hard to do by audio, so I did it all by video. My assumption is - as long as I can match video frames, Audio will be fine.
With that in mind, using ffmpeg, I process the beginning and the end of the video (usually 30, 60 or 120 seconds; longer might be more precise but a little more work). There, for the duration, I extracted keyframes that come with the actual frame number, and it is possible to calculate time from the frame. Then I look for the frameset (2 or 3 frames) that has the same content between both files. For example, if a Japanese video starts with an opening song while an English video starts with the company logo and opening song, then you will eventually find a pair of frames that have the same content from the opening song in both of them. For comparison, I used image hash.
After the pair is found in each video, I calculate the difference between them (usually the length of the extra logo, etc.).
After this, I repeat the same procedure at the end of the video.
Next step is to compare offset (what was the difference between frame pairs in the beginning and the end). In the ideal case, the offset difference between the start and end is below 0.05s, and it is perfect. Sometimes, it is around 0.8s, and that is still good. In rare cases, it might be > 1s, and that is not good and ideal for the manual check, or sometimes something goes wrong, and a match is not found, or the difference is like the 20s, then nothing will work. For these bad cases, I had an option in the code where I could create a giant JPG image with keyframes and time and check manually with the ability to override the offset.
After all this, just use calculated offset for each "audio" for both audio and subtitles (there is still an issue with subtitles that I have not mentioned, like if you have a Japanese video that comes with English subs and English audio, that also comes with English subs [subs cover signs and other onscreen text, not dialogue])
Technical version
I used Claude 3 Opus to convert Python code to TypeScript. Python code also included.
Getting frame data code
import * as fs from 'fs';
import * as path from 'path';
import * as imagehash from 'image-hash';
import { Image } from 'image-js';
import * as math from 'mathjs';
interface StartFrame {
time: number;
frame: number;
hash: string;
}
class VideoProcessor {
private video_file_path: string;
private video_height: number;
private video_width: number;
private fps: number;
private duration: number;
private check_seconds: number;
private use_third: boolean;
private debug: boolean;
private temp: string;
private language: { code3: string };
private start_frames: StartFrame[];
private end_frames: StartFrame[];
constructor(/* ... */) {
// Initialize the properties in the constructor
// ...
}
public async processing_frames(): Promise<void> {
console.log("Processing Frames");
const process_time = this.check_seconds;
const offset_time = (Math.round(this.duration * 10) / 10 - 120 - process_time);
const process = async (duration: number, offset: number, prefix: string): Promise<StartFrame[]> => {
// Start time for measuring processing duration
const start = Date.now();
// Extract frames from the video using ffmpeg
const { stdout, stderr } = await new Promise<{ stdout: Buffer; stderr: string }>((resolve, reject) => {
ffmpeg(this.video_file_path)
.input(this.video_file_path)
.seekInput(offset)
.videoFilters([
{
filter: 'select',
options: 'gt(scene\\,0.1)'
},
{
filter: 'showinfo'
}
])
.output('pipe:')
.outputOptions([
'-t', duration.toString(),
'-fps_mode', 'vfr',
'-frame_pts', '1',
'-f', 'rawvideo',
'-pix_fmt', 'rgb24'
])
.on('error', (err, stdout, stderr) => {
reject(err);
})
.on('end', (stdout, stderr) => {
resolve({ stdout, stderr });
})
.run();
});
// End time for measuring processing duration
const end = Date.now();
if (this.debug) {
console.log(`Time to extract frame data: ${end - start}ms`);
}
// Decode the error output from ffmpeg
const err = stderr.toString('utf-8');
// Initialize lists to store timeframes, start frames, and start video frames
const timeframes: number[] = [];
const start_frames: StartFrame[] = [];
const start_video: Uint8Array[] = [];
// Convert the raw video data to a Uint8Array
const temp_start_video = new Uint8Array(stdout).reduce((resultArray, item, index) => {
const chunkIndex = Math.floor(index / (this.video_width * this.video_height * 3));
if (!resultArray[chunkIndex]) {
resultArray[chunkIndex] = [];
}
resultArray[chunkIndex].push(item);
return resultArray;
}, []);
// If using the middle third of the frames, crop the frames accordingly
if (this.use_third) {
for (const frame of temp_start_video) {
const im_pil = await Image.load(frame);
const crop_x1 = 0;
const crop_x2 = im_pil.width;
const crop_y1 = Math.floor(im_pil.height * 0.333);
const crop_y2 = Math.floor(im_pil.height * 0.666);
const third_image = im_pil.crop({ x: crop_x1, y: crop_y1, width: crop_x2 - crop_x1, height: crop_y2 - crop_y1 });
start_video.push(await third_image.toUint8Array());
}
} else {
start_video.push(...temp_start_video);
}
// Extract timeframes from the ffmpeg error output
const regex = /pts_time(.*?)duration/g;
let match;
while ((match = regex.exec(err)) !== null) {
const timeStr = match[1].trim();
const time = parseFloat(timeStr);
timeframes.push(time);
}
// Create a list of start frames with their corresponding time, frame number, and image hash
for (let i = 0; i < start_video.length; i++) {
const time = timeframes[i] + offset;
const frame = Math.floor(time * this.fps);
const hash = await imagehash.hash(await Image.load(start_video[i]), 8, 'binary');
start_frames.push({ time, frame, hash });
}
// If in debug mode, create a collage of the start frames and save it
if (this.debug) {
const details = path.join(this.temp, `v_${this.language.code3}_${prefix}.png`);
if (fs.existsSync(details)) {
fs.unlinkSync(details);
}
const base_image = await Image.load(start_video[0]);
const new_height = 320;
const new_width = Math.floor(new_height / base_image.height * base_image.width);
const horizontal_images = 5;
const vertical_images = Math.ceil(start_video.length / horizontal_images);
const collage = new Image(new_width * horizontal_images, new_height * vertical_images);
let image_id = 0;
// Iterate over the start frames and paste them onto the collage
for (let y = 0; y < vertical_images; y++) {
for (let x = 0; x < horizontal_images; x++) {
if (image_id < start_frames.length) {
const frame_image = await Image.load(start_video[image_id]);
const resized_image = frame_image.resize({ width: new_width, height: new_height });
collage.drawImage(resized_image, { x: x * new_width, y: y * new_height });
collage.drawText(`${start_frames[image_id].time} (${start_frames[image_id].frame})`, { x: x * new_width, y: y * new_height, color: [255, 0, 0] });
image_id++;
}
}
}
await collage.save(details);
}
// Return the list of start frames
return start_frames;
};
// Process the start frames
this.start_frames = await process(process_time, 0, 'start');
// Process the end frames
this.end_frames = await process(process_time, offset_time, 'end');
}
}
Getting frame data code Python version
def processing_frames(self):
print(f"Processing Frames")
process_time = self.check_seconds
offset_time = (round(self.duration, 4) - 120 - process_time)
def process(duration, offset, prefix):
# Start time for measuring processing duration
start = time.time()
# Extract frames from the video using ffmpeg
out, err = (
ffmpeg
.input(self.video_file_path, ss=offset)
.filter('select', 'gt(scene, 0.1)')
.filter('showinfo')
.output('pipe:', t=duration, fps_mode='vfr', frame_pts=True, format='rawvideo', pix_fmt='rgb24')
.run(capture_stdout=True, capture_stderr=True)
)
# End time for measuring processing duration
end = time.time()
if self.debug:
print("Time to extra frame data: {}".format(end - start))
# Decode the error output from ffmpeg
err = err.decode("utf-8")
# Initialize lists to store timeframes, start frames, and start video frames
timeframes = []
start_frames = []
start_video = []
# Convert the raw video data to a numpy array
temp_start_video = (
np
.frombuffer(out, np.uint8)
.reshape([-1, self.video_height, self.video_width, 3])
)
# If using the middle third of the frames, crop the frames accordingly
if self.use_third:
for frame in temp_start_video:
im_pil = Image.fromarray(frame)
crop_x1 = 0
crop_x2 = im_pil.width
crop_y1 = int(im_pil.height * 0.333)
crop_y2 = int(im_pil.height * 0.666)
third_image = im_pil.crop((crop_x1, crop_y1, crop_x2, crop_y2))
start_video.append(np.array(third_image))
else:
start_video = temp_start_video
# Extract timeframes from the ffmpeg error output
for line in iter(err.splitlines()):
if 'pts_time' in line and 'duration' in line:
timeframes.append(
float(re.findall("\d+(?:\.\d+)?", re.search('pts_time(.*)duration', line).group(1))[0]))
# Create a list of start frames with their corresponding time, frame number, and image hash
for i in range(len(start_video)):
start_frames.append({'time': timeframes[i] + offset, 'frame': int((timeframes[i] + offset) * self.fps),
'hash': imagehash.phash(Image.fromarray(start_video[i]))})
# If in debug mode, create a collage of the start frames and save it
if self.debug:
details = f'{self.temp}{os.path.sep}v_{self.language.code3}_{prefix}.png'
if os.path.exists(details):
os.remove(details)
base_image = Image.fromarray(start_video[0])
new_height = 320
new_width = int(new_height / base_image.height * base_image.width)
horizontal_images = 5
vertical_images = math.ceil(len(start_video) / horizontal_images)
collage = Image.new("RGBA", (new_width * horizontal_images, new_height * vertical_images))
image_id = 0
font = ImageFont.truetype(r"C:\Windows\Fonts\arial.ttf", 24)
draw = ImageDraw.Draw(collage)
# Iterate over the start frames and paste them onto the collage
for y in range(vertical_images):
for x in range(horizontal_images):
if len(start_frames) > image_id:
collage.paste(Image.fromarray(start_video[image_id]).convert("RGBA").resize((new_width, new_height)), (x * new_width, y * new_height))
draw.text((x * new_width, y * new_height), text=f'{start_frames[image_id]["time"]} ({start_frames[image_id]["frame"]})', font=font, allign="left",
fill="red")
image_id += 1
collage.save(details)
# Return the list of start frames
return start_frames
# Process the start frames
self.start_frames = process(process_time, 0, 'start')
# Process the end frames
self.end_frames = process(process_time, offset_time, 'end')
Offset calculation in the provided code works as checking between current video and provided. So, if provided is the same as current, then nothing is calculated.
Comparing frame data
interface FrameData {
hash: number;
time: number;
}
interface VideoData {
start_frames: FrameData[];
end_frames: FrameData[];
language: {
english_name: string;
};
threshold: number;
forced_offset: boolean;
forced_offset_value: number;
}
function calculate_offset(base_data: VideoData): void {
/**
* Calculate the offset between base_data and self.
* @param base_data The base video data to compare against.
*/
if (base_data === this) {
return;
}
function compare_frames(base_frames: FrameData[], current_frames: FrameData[], reverse = false): [boolean, number] {
/**
* Compare frames between base_frames and current_frames to find the offset.
* @param base_frames Frames from the base video data.
* @param current_frames Frames from the current video data.
* @param reverse Whether to compare frames in reverse order.
* @return A tuple indicating if an offset is found and the offset value.
*/
let have_offset = false;
let offset = 0;
const hash_threshold = 1;
// Iterate over frames in the specified order
for (let i = reverse ? base_frames.length - 2 : 0; reverse ? i >= 0 : i < base_frames.length - 1; reverse ? i-- : i++) {
const base_index = i;
const base_second_index = i + 1;
let base_check_value = 64;
let pair_check_value = 64;
let pair_index = 0;
let pair_second_index = 0;
// Compare hash values of frames between base_frames and current_frames
for (let j = reverse ? current_frames.length - 1 : 0; reverse ? j >= 0 : j < current_frames.length; reverse ? j-- : j++) {
let hash_diff = base_frames[base_index].hash - current_frames[j].hash;
if (hash_diff < hash_threshold && hash_diff < base_check_value) {
base_check_value = hash_diff;
pair_index = j;
}
hash_diff = base_frames[base_second_index].hash - current_frames[j].hash;
if (hash_diff < hash_threshold && hash_diff < pair_check_value) {
pair_check_value = hash_diff;
pair_second_index = j;
}
}
// Check if consecutive frames match
if (pair_index + 1 === pair_second_index) {
have_offset = true;
break;
}
}
if (have_offset) {
offset = base_frames[base_index].time - current_frames[pair_index].time;
}
return [have_offset, offset];
}
// Calculate offset using start frames if not forced
if (!this.forced_offset) {
[this.have_offset, this.offset] = compare_frames(base_data.start_frames, this.start_frames, true);
// Calculate offset using end frames and check tolerance
const [have_end_offset, end_offset] = compare_frames(base_data.end_frames, this.end_frames, true);
const end_tolerance = this.threshold;
const check_end = Math.abs(Math.abs(this.offset) - Math.abs(end_offset)) < end_tolerance;
this.have_offset = this.have_offset && check_end;
this.offset = Math.round(this.offset * 10000000) / 10000000;
console.log(`${this.language.english_name} offset ${this.offset} is used - ${this.have_offset ? "\x1b[32m" : "\x1b[31m"}${this.have_offset}${this.have_offset ? "\x1b[0m" : "\x1b[0m"}. End tolerance ${Math.abs(Math.abs(this.offset) - Math.abs(end_offset))} (needed ${end_tolerance} to pass)`);
} else {
// Use forced offset value
this.have_offset = true;
this.offset = this.forced_offset_value;
console.log(`Using \x1b[34mForced offset\x1b[0m \x1b[33m${this.offset}\x1b[0m`);
}
}
Comparing frame data in Python
def calculate_offset(self, base_data):
"""
Calculate the offset between base_data and self.
:param VideoData base_data: The base video data to compare against.
"""
if base_data == self:
return
def compare_frames(base_frames, current_frames, reverse=False):
"""
Compare frames between base_frames and current_frames to find the offset.
:param base_frames: Frames from the base video data.
:param current_frames: Frames from the current video data.
:param reverse: Whether to compare frames in reverse order.
:return: A tuple indicating if an offset is found and the offset value.
"""
have_offset = False
offset = 0
hash_threshold = 1
# Iterate over frames in the specified order
for i in reversed(range(len(base_frames) - 1)) if reverse else range(len(base_frames) - 1):
base_index = i
base_second_index = i + 1
base_check_value = 64
pair_check_value = 64
pair_index = 0
pair_second_index = 0
# Compare hash values of frames between base_frames and current_frames
for j in reversed(range(len(current_frames))) if reverse else range(len(current_frames)):
hash_diff = base_frames[base_index]['hash'] - current_frames[j]['hash']
if hash_diff < hash_threshold and hash_diff < base_check_value:
base_check_value = hash_diff
pair_index = j
hash_diff = base_frames[base_second_index]['hash'] - current_frames[j]['hash']
if hash_diff < hash_threshold and hash_diff < pair_check_value:
pair_check_value = hash_diff
pair_second_index = j
# Check if consecutive frames match
if pair_index + 1 == pair_second_index:
have_offset = True
break
if have_offset:
offset = base_frames[base_index]['time'] - current_frames[pair_index]['time']
return have_offset, offset
# Calculate offset using start frames if not forced
if not self.forced_offset:
self.have_offset, self.offset = compare_frames(base_data.start_frames, self.start_frames, True)
# Calculate offset using end frames and check tolerance
have_end_offset, end_offset = compare_frames(base_data.end_frames, self.end_frames, True)
end_tolerance = self.threshold
check_end = abs(abs(self.offset) - abs(end_offset)) < end_tolerance
self.have_offset = self.have_offset and check_end
self.offset = round(self.offset, 7)
console.print(f'{self.language.english_name} offset {self.offset} is used - '
f'{"\[green\]" if self.have_offset else "\[red\]"}{self.have_offset}'
f'{"\[/green\]" if self.have_offset else "\[/red\]"}. '
f'End tolerance {abs(abs(self.offset) - abs(end_offset))} (needed {end_tolerance} to pass)')
else:
# Use forced offset value
self.have_offset = True
self.offset = self.forced_offset_value
console.print(f'Using \[blue\]Forced offset\[/blue\] \[yellow\]{self.offset}\[/yellow\]')
I am sure I forgot some details about the code and will be happy to answer anything.