Video/Audio Sync & Merge — PySide6 Edition

A focused desktop tool to analyze A/V timing and perform a lossless MKV remux with predictable, auditable behavior.
The app discovers delays, applies a positive‑only global shift (so no content gets trimmed), handles chapters and subtitles, and writes a mkvmerge options file you can inspect and replay.

Scope of this README
This document describes the baseline application — the exact code you shared (before any experimental “expert/advanced repair” features). It is intentionally exhaustive so a new contributor can understand end‑to‑end behavior and implementation details.

Key Ideas
Architecture
Job Lifecycle
Delay Discovery Engines
- Audio Cross‑Correlation (deep dive)
- VideoDiff
Positive‑Only Timing Model
Merge Planning
- Profile‑Driven (Merge Plan)
- Manual Selection
Subtitles
Chapters
Attachments
mkvmerge Options File
Temporary Files, Outputs, Logs
GUI Overview
Configuration Reference
Dependencies
Run It
Troubleshooting
Developer Notes
Design Invariants & Edge Cases
Performance Notes
Known Limitations
Appendix A: Log Line Guide
Appendix B: Demux Extension Map
Appendix C: Configuration Keys (Table)

Key Ideas

Lossless by design: the pipeline never applies negative per‑track delays that could discard leading content. We transform all timings into a non‑negative scheme via a global shift.
Deterministic & auditable: we construct the mkvmerge command as a token array and persist it (opts.json). You can read, diff, and replay it.
Separation of concerns: analysis, planning, extraction, subtitles, chapters, and merging are isolated but connected through explicit data structures.
Language‑aware analysis (optional): reference/target audio stream selection can prefer a specific language tag (e.g., jpn, eng) to improve correlation robustness.

Architecture

repo-root/
├─ main.py                      # App entry point
├─ vsg_core/                    # Headless engine
│  ├─ analysis.py               # Delay discovery (Audio XCorr, VideoDiff)
│  ├─ config.py                 # Settings load/save, defaults, dir creation
│  ├─ job_discovery.py          # Single-file & folder batch discovery
│  ├─ mkv_utils.py              # mkvmerge/mkvextract helpers, demux, chapters
│  ├─ pipeline.py               # Orchestration of a full job
│  ├─ process.py                # Command runner with compact logging
│  └─ subtitle_utils.py         # SRT→ASS, rescale, font-size multiply
└─ vsg_qt/                      # PySide6 GUI
   ├─ main_window.py            # Main window, file pickers, log, actions
   ├─ worker.py                 # Threaded job runner (no UI freeze)
   ├─ manual_selection_dialog.py# Drag/drop track picker + flags
   ├─ options_dialog.py         # Settings tabs & Merge Profile editor
   └─ track_widget.py           # Per-track widget with toggles

Key modules

vsg_core.process.CommandRunner — canonical way to run external processes with uniform logging, progress throttling, error tails.
vsg_core.analysis — two delay engines:
- Audio cross‑correlation (librosa + scipy) with chunked extraction via ffmpeg.
- VideoDiff integration (external tool), with error bounds.
vsg_core.pipeline.JobPipeline — the orchestrator. Calls analysis, converts delays into positive‑only residuals, builds extraction + mkvmerge plans, executes merge, writes artifacts.
vsg_core.mkv_utils — inspects MKVs (mkvmerge -J), extracts tracks (mkvextract + ffmpeg for special cases), processes chapters (XML), and attaches files.
vsg_core.subtitle_utils — SRT→ASS conversion, ASS PlayRes rescale, font size multiplication (style‑aware, line‑by‑line safe).

Job Lifecycle

Processing one Reference file with optional Secondary and Tertiary files:

Analysis
- For each Secondary/Tertiary, compute delay vs Reference using selected engine.
- Log per‑chunk results and final “determined” delay.
Merge Planning
- Convert raw delays to a positive‑only scheme (see below).
- Build a track plan:
  - Merge Plan mode uses the JSON‑like rule set in settings.
  - Manual Selection mode uses the user’s drag/drop layout.
Extraction
- Demux only those tracks that appear in the final plan. Special handling for A_MS/ACM (attempt copy, else decode to PCM with correct bit depth).
Subtitles (optional per track)
- Convert SRT→ASS, rescale PlayRes to match video, multiply text size.
Chapters
- Optional rename to “Chapter NN”.
- Shift all ChapterTimeStart/End by the global shift.
- Optional snap starts (and optionally ends) to keyframes within a threshold.
Merge Execution
- Create opts.json (mkvmerge token array) and (optionally) a pretty dump.
- Run mkvmerge with @opts.json.
Cleanup
- On success (or on analyze‑only), remove the temp job directory.
- Logs are written next to the output MKV (or in the chosen output folder).

Delay Discovery Engines

Audio Cross‑Correlation (deep dive)

Objective: Estimate relative delay (ms) between the Reference audio stream and a target (Secondary/Tertiary) stream.

Even when languages differ, many mixes share music/SFX transients. By correlating short segments that avoid long silences, we locate a robust lag.

Stream selection (language‑aware)

Inspect mkvmerge -J JSON and enumerate tracks of type audio.
If the user provided a language code (e.g., analysis_lang_sec="jpn"), pick the first audio whose properties.language matches.
Otherwise, use the first audio track.
Indexes map to ffmpeg -map 0:a:<index> for extraction.

For best robustness, prefer like‑for‑like (e.g., JPN vs JPN) when available.

Window extraction

For each scan point tᵢ, extract mono, 48 kHz WAV windows from both files:

ffmpeg -y -v error -ss <tᵢ> -i <file> -map 0:a:<idx> -t <dur> -vn \
       -acodec pcm_s16le -ar 48000 -ac 1 <out.wav>

Default duration dur = 15s (configurable).
Conservative logging: the exact command line is mirrored to the GUI log.

Scan schedule

Determine program duration D from ffprobe (format duration).
Defaults:
- scan_chunk_count = 10
- scan_chunk_duration = 15
- We analyze a band [0.10*D, 0.90*D) and distribute windows evenly (skip early logos and late credits to avoid silence).

This balances robustness with I/O/runtime.

DSP steps

We load WAVs (mono) with librosa, preserving native rate (48 kHz). Then normalize to z‑scores to remove gain bias:

x = (x - mean(x)) / (std(x) + 1e-9)
y = (y - mean(y)) / (std(y) + 1e-9)

Compute full discrete cross‑correlation (scipy), then find the peak lag:

c = correlate(x, y, mode='full', method='auto')
k* = argmax(c) - (len(y) - 1)     # lag in samples (y vs x)
τ  = k* / fs                       # seconds
delay_ms = round(1000 * τ)

We also compute a match/confidence heuristic:

norm = sqrt( sum(x^2) * sum(y^2) )
match_pct = 100 * (max(|c|) / (norm + 1e-9))

This is a normalized peak height — higher is “sharper” alignment. It is not a probability.

Pre‑whitening / DC removal

The z‑score step behaves like a simple pre‑whitening/normalization, reducing the impact of level shifts and DC offsets so that edge energy (transients) drives the peak.

Windowing considerations

15s windows capture multiple transients; longer windows (20–30s) increase robustness at the cost of time.
If matches are weak, increase duration or adjust language selection.

Aggregating results across windows

Drop windows with match_pct <= min_match_pct (default 5.0).
Compute the mode (most frequent) delay among remaining windows.
From the modal group, pick the window with max match %. That tuple (delay_ms, match%) is the determined delay.

This favors consistency over any single window’s outlier result.

Practical tuning

Symptom	What to try
Low match% overall	Increase `scan_chunk_duration` to 20–30s; ensure language selection compares similar mixes (e.g., JPN vs JPN).
Two clusters of delays	Baseline uses one global delay; confirm with Analyze Only and consider whether underlying media truly has a splice (outside baseline scope).
Slow analysis	Reduce `scan_chunk_count`; shrink scan band if needed.

VideoDiff

If analysis_mode="VideoDiff":

Execute external videodiff with (ref, target) and parse the last [Result] line.
Extract either ss: or itsoffset: seconds and error: value.
If kind is ss, invert sign for our delay semantics.
Enforce that error ∈ [videodiff_error_min, videodiff_error_max]; otherwise reject result.

Use this when audio mixes are too divergent for correlation (e.g., commentary tracks).

Positive‑Only Timing Model

Problem: mkvmerge --sync with negative values can drop leading content.

Solution: Convert raw delays to non‑negative residuals by applying a global shift equal to the absolute most negative delay.

Let raw delays (ms) be:

ref = 0
sec = -1001
ter = -1000

Global shift: global_shift = -min(ref, sec, ter) = 1001
Residuals:
- ref_resid = ref + global_shift = 1001
- sec_resid = sec + global_shift = 0
- ter_resid = ter + global_shift = 1
Merge sync flags (per input group): --sync 0:<residual_ms>
Chapters: shift all timestamps by +global_shift so chapters align with delayed streams.

This guarantees that no input is asked to start before t=0 in mkvmerge, eliminating trimming.

Merge Planning

After delays are converted, the planner builds a final list of (track, flags) entries and a --track-order to match GUI order.

Profile‑Driven (Merge Plan)

A prioritized list of rules (Settings → Merge Plan) defines what to include. Each rule:

source: REF | SEC | TER
type: Video | Audio | Subtitles
lang: CSV or any (match against properties.language)
exclude_langs: CSV (omit these even if lang=any)
enabled: bool
is_default: bool (first match of this type becomes the default track)
is_forced_display: bool (subs)
swap_first_two: bool (subs; swap first two matches)
apply_track_name: bool (pass the input’s track_name to output)
rescale: bool (ASS/SSA only; rewrite PlayRes to video)

Global codec exclusions (exclude_codecs) filter all matches whose codec id contains any excluded token (e.g., ac3, dts, pcm).

Default logic

The first video is implicitly default.
Exactly one audio and one subtitles track can be marked default via flags.

Per‑track mkvmerge arguments (generated in order):

--language 0:<lang>
--track-name 0:<name>                 # if apply_track_name and input had a name
--sync 0:<global_shift + role_residual>
--default-track-flag 0:<yes/no>
--forced-display-flag 0:yes           # if is_forced_display
--compression 0:none
--remove-dialog-normalization-gain 0  # if enabled and codec is AC3/E-AC3
( <extracted-file-path> )

Finally we add attachments (if any) and --track-order <inputIdx0>:0,<inputIdx1>:0,....

Manual Selection

Instead of rules, the user drags tracks into Final Output:

Reorder entries to match desired output.
Per‑entry toggles: Default (A/V/S), Forced (subs), Keep Name, Convert to ASS (SRT), Rescale (ASS/SSA), Size Multiplier (subs).

Batch auto‑apply: If enabled, and the shape signature (counts of [source × type]) of the next file matches the previous, automatically carry over the layout.

All subsequent stages (extraction, chapters, merge) are identical to the rule‑based plan.

Subtitles

SRT → ASS (optional): via ffmpeg; if output exists, we replace the path.
Rescale PlayRes (ASS/SSA only): probe reference video width/height via ffprobe and rewrite PlayResX/PlayResY if they differ.
Font size multiplier (ASS/SSA only): parse Style: lines and multiply the font size value, keeping other fields intact. Safe parsing avoids corrupting the file.

Chapters

Rename (optional): clear ChapterDisplay and write “Chapter NN”.
Shift timestamps: add global_shift (ms → ns) to both ChapterTimeStart and ChapterTimeEnd.
Snap to keyframes (optional): probe keyframes via ffprobe and, for starts (and optionally ends), move within snap_threshold_ms according to mode (previous or nearest).
Normalize ends: ensure each chapter has a valid end; cap ends to next chapter’s start; guarantee strictly increasing intervals.

All chapter edits are written to a temporary XML file and passed to mkvmerge via --chapters.

Attachments

If the Tertiary file contains attachments (fonts, images), we extract and --attach-file them to the output. These are input‑agnostic artifacts and do not interact with sync timing.

mkvmerge Options File

We emit tokens as JSON (opts.json) and run:

mkvmerge @<opts.json>

Optionally we also write a pretty text dump (opts.pretty.txt) for human inspection:

--output "<out.mkv>" \
  --chapters "<chapters.xml>" \
  --language 0:jpn --sync 0:1001 --default-track-flag 0:yes --compression 0:none ( "<ref_video.h264>" ) \
  ...

This makes the merge reproducible and easy to debug.

Temporary Files, Outputs, Logs

Each job creates a unique temp dir: temp_root/job_<ref-stem>_<epoch>/ with:

ref_track_*, sec_track_*, ter_track_*: demuxed streams
_chapters_modified.xml: edited chapters
opts.json (+ optional opts.pretty.txt): mkvmerge args
wav_*: short analysis windows for audio correlation
att_*: attachments from TER if present

Output: <output_folder>/<ReferenceFileName>.mkv (same filename as Reference)
Run log: <output_folder>/<ReferenceFileName>.log

Compact logging shows throttled progress and prints the tail of stderr on error for signal‑to‑noise.

GUI Overview

Inputs: Reference, Secondary, Tertiary (files or directories).
Modes:
- Merge Plan (profile rules)
- Manual Selection (drag/drop final list; optional auto‑apply across batch)
Actions:
- Analyze Only → Compute and display delays (no merge).
- Analyze & Merge → Full pipeline.
Settings Tabs:
- Storage: output folder, temp root, optional VideoDiff path
- Analysis: engine choice, chunk count/duration, min match %, VideoDiff error bounds, language prefs (REF/SEC/TER)
- Chapters: rename, snap mode/threshold, starts‑only toggle
- Merge Behavior: remove dialog normalization gain, codec blacklist, disable track statistics tags
- Logging: compact mode, autoscroll, progress step %, error tail lines, pretty/json options dump
- Merge Plan: rule editor with priority ordering

Configuration Reference

Settings are persisted to settings.json. Missing keys are auto‑added with defaults.

Storage & Tools
- output_folder (str) — default sync_output/ under repo root
- temp_root (str) — default temp_work/ under repo root
- videodiff_path (str) — blank uses PATH
Analysis
- analysis_mode (str) — "Audio Correlation" | "VideoDiff"
- scan_chunk_count (int) — default 10
- scan_chunk_duration (int, seconds) — default 15
- min_match_pct (float) — default 5.0
- analysis_lang_ref / analysis_lang_sec / analysis_lang_ter (str, optional ISO like jpn, eng)
- videodiff_error_min / videodiff_error_max (float) — bounds for VideoDiff acceptance
Workflow
- merge_mode (str) — "plan" | "manual"
Chapters
- rename_chapters (bool)
- snap_chapters (bool)
- snap_mode (str) — "previous" | "nearest"
- snap_threshold_ms (int) — default 250
- snap_starts_only (bool) — only snap chapter starts
Merge Behavior
- apply_dialog_norm_gain (bool) — remove dialnorm for AC3/E‑AC3
- exclude_codecs (str) — comma list (e.g., "ac3, dts, pcm")
- disable_track_statistics_tags (bool)
- merge_profile (list[rule]) — see Merge Plan
Logging
- log_compact (bool) — compact stdout
- log_autoscroll (bool) — GUI behavior
- log_progress_step (int %) — progress throttling (e.g., 20 → 0/20/40/60/80/100)
- log_error_tail (int lines) — tail lines printed on error
- log_tail_lines (int lines) — tail lines printed on success
- log_show_options_pretty / log_show_options_json (bool)
Archival
- archive_logs (bool) — after batch, zip per‑file logs and delete the originals

Dependencies

Python 3.9+
MKVToolNix: mkvmerge, mkvextract
FFmpeg: ffmpeg, ffprobe
VideoDiff (optional): if using that mode
Python packages: PySide6, librosa, numpy, scipy

Ensure binaries are on PATH (or set explicit paths in Settings).

Run It

python main.py

Select Reference (and optional Secondary/Tertiary). Files or matching folders.
Choose Analyze Only to validate delays, or Analyze & Merge to produce the final MKV.
Watch the log for:
- Per‑chunk XCorr lines and final delays
- Positive‑only global shift
- Chapter processing summary
- mkvmerge options file path
- Success path for the output file

Troubleshooting

Tool not found — make sure mkvmerge, mkvextract, ffmpeg, ffprobe are installed and on PATH.
XCorr unstable/low confidence — increase scan_chunk_duration and/or scan_chunk_count; ensure language selection targets comparable mixes (JPN vs JPN, ENG vs ENG).
Defaults/forced flags not what you expected — In Merge Plan, check rule ordering and flags; in Manual mode, adjust the final list toggles.
Chapters misaligned — Verify global_shift in logs equals the shift applied to chapter XML; if snapping is on, try increasing snap_threshold_ms or switch snap_mode.
mkvmerge failure — Open opts.json, replay via terminal, and examine stderr; the app also prints the error tail.

Developer Notes

Demux strategy: mkvextract for general tracks; for A_MS/ACM we first try stream copy with ffmpeg, else decode to PCM using a bit‑depth‑aware codec (pcm_s16le, pcm_s24le, pcm_s32le, pcm_f64le).
Language selection (analysis only) is independent from merge inclusion rules.
Track ordering is fully deterministic: we append inputs in the exact GUI/plan order and then emit a matching --track-order.
Logging style balances signal/noise; compact mode prints progress and only a tail of verbose output on success/failure.

Design Invariants & Edge Cases

No negative --sync is ever passed to mkvmerge; all per‑input --sync values are ≥ 0 after applying the global shift.
Reference video dictates chapter rescale and subtitle PlayRes.
Manual layout auto‑apply is only used when the shape signature (counts of [source × type]) matches the prior job to prevent accidental mismatches.
Codec exclusions are substring checks against codec_id lowercased (e.g., a_ac3, a_dts, a_pcm).
Chapters normalization ensures strictly increasing intervals and prevents open‑ended atoms from overlapping into the next.

Performance Notes

XCorr windowing is the main cost. Defaults (10 × 15s) trade speed vs. robustness.
SSD churn is minimized by deleting temp job directories on success.
For faster previews, reduce scan_chunk_count and/or scan_chunk_duration — then confirm with a second run if needed.

Known Limitations

Baseline engine uses a single global delay per Secondary/Tertiary. It does not splice or model time‑varying drift (that would be an “advanced repair” feature outside this README’s scope).
XCorr can be confused by long uniform ambiences; tuning the window schedule usually fixes it.
VideoDiff requires a separate binary and is subject to its error metric semantics.

Appendix A: Log Line Guide

Examples you’ll see in the GUI log:

$ ffprobe -v error -select_streams v:0 -show_entries format=duration -of csv=p=0 "ref.mkv"
Chunk @1278s -> Delay -1001 ms (Match 95.28%)
Secondary delay determined: -1001 ms
[Delay] Raw delays (ms): ref=0, sec=-1001, ter=-1000
[Delay] Applying lossless global shift: +1001 ms
[Chapters] Renamed chapters to "Chapter NN".
[Chapters] Shifted all timestamps by +1001 ms.
[Chapters] Snap result: moved=3, on_kf=5, too_far=1 (kfs=1234, mode=previous, thr=250ms, starts_only=True)
mkvmerge options file written to: temp_work/job_ref_.../opts.json
[SUCCESS] Output file created: sync_output/RefTitle.mkv

Appendix B: Demux Extension Map

Track type	codec_id contains	Demux extension
video	`V_MPEGH/ISO/HEVC`	`.h265`
	`V_MPEG4/ISO/AVC`	`.h264`
	`V_MPEG1/2`	`.mpg`
	`V_VP9`	`.vp9`
	`V_AV1`	`.av1`
	(else)	`.bin`
audio	`A_TRUEHD`	`.thd`
	`A_EAC3`	`.eac3`
	`A_AC3`	`.ac3`
	`A_DTS`	`.dts`
	`A_AAC`	`.aac`
	`A_FLAC`	`.flac`
	`A_OPUS`	`.opus`
	`A_VORBIS`	`.ogg`
	`A_PCM`	`.wav`
	(else)	`.bin`
subs	`S_TEXT/ASS`	`.ass`
	`S_TEXT/SSA`	`.ssa`
	`S_TEXT/UTF8`	`.srt`
	`S_HDMV/PGS`	`.sup`
	`S_VOBSUB`	`.sub`
	(else)	`.sub`

Special case: A_MS/ACM → attempt stream copy; if refused, decode to PCM with bit‑depth‑aware codec.

Appendix C: Configuration Keys (Table)

Key	Type	Default	Notes
`output_folder`	str	`sync_output`	Output target for merged MKV & job logs
`temp_root`	str	`temp_work`	Per‑job scratch directory root
`videodiff_path`	str	`""`	If blank, use PATH
`analysis_mode`	str	`Audio Correlation`	or `VideoDiff`
`scan_chunk_count`	int	`10`	Number of windows for XCorr
`scan_chunk_duration`	int	`15`	Seconds per window
`min_match_pct`	float	`5.0`	Discard XCorr results below this
`analysis_lang_ref`	str	`""`	ISO (`jpn`, `eng`) or blank for first stream
`analysis_lang_sec`	str	`""`	Same as above
`analysis_lang_ter`	str	`""`	Same as above
`videodiff_error_min`	float	`0.0`	Reject if error < min
`videodiff_error_max`	float	`100.0`	Reject if error > max
`merge_mode`	str	`plan`	or `manual`
`rename_chapters`	bool	`false`	Rename to `Chapter NN`
`snap_chapters`	bool	`false`	Enable keyframe snapping
`snap_mode`	str	`previous`	or `nearest`
`snap_threshold_ms`	int	`250`	Max move distance
`snap_starts_only`	bool	`true`	Only snap starts
`apply_dialog_norm_gain`	bool	`false`	Remove dialnorm for AC3/E‑AC3
`disable_track_statistics_tags`	bool	`false`	mkvmerge flag
`exclude_codecs`	str	`""`	CSV blacklist (`ac3,dts,pcm`)
`merge_profile`	list	(see defaults)	Rule list, priority‑ordered
`log_compact`	bool	`true`	Compact command runner logs
`log_autoscroll`	bool	`true`	GUI behavior
`log_progress_step`	int	`20`	% step for progress lines
`log_error_tail`	int	`20`	stderr tail lines on error
`log_tail_lines`	int	`0`	stdout tail on success
`log_show_options_pretty`	bool	`false`	Dump pretty opts
`log_show_options_json`	bool	`false`	Dump raw JSON opts
`archive_logs`	bool	`true`	Zip logs after batch

License
This project wraps external tools; respect their licenses. The GUI/engine code is released under the project’s chosen license (add a LICENSE file if needed).

wingedonezero/Video-Sync-GUI