ytdl-org/youtube-dl

Cross-platform duplicates (extension duplicates)

Closed this issue · 5 comments

  • I'm reporting a broken site support issue
  • I've verified that I'm running youtube-dl version 2019.11.28
  • I've checked that all provided URLs are alive and playable in a browser
  • I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • I've searched the bugtracker for similar bug reports including closed ones
  • I've read bugs section in FAQ

Verbose log

No errors or warning generated

Description

Duplicates are generated when a drive is shared between OS's, e.g. Windows & Linux.

This is due to the fact that the video files are saved in different formats (e.g. .mkv and .mp4)

For example, if a batch file is executed on Linux, and then resumed on Windows, each file will be duplicated (once with mkv and once with mp4). This results in duplicated videos and wasted download time.

Example

A batch file contains many links:

...
https://www.youtube.com/watch?v=z7aXex18_wg
...

The batch file is downloaded on Linux, resulting in the file:
5 Simple RV Hacks-z7aXex18_wg.mkv

Now the batch file is resumed on Windows. You'd expect to see '... has already been downloaded', but instead you get the following file
5 Simple RV Hacks-z7aXex18_wg.mp4

Now you have duplicate files and the files that had already been downloaded on Linux are downloaded again

Expected Behavior

Files should not be duplicated, regardless of extension

Proposed Fix

Before downloading a video, check if that video exists in directory with any extension

--download-archive FILE          Download only videos not listed in the
                                 archive file. Record the IDs of all
                                 downloaded videos in it.

This does not solve the issue because there is no archive file in the above scenario.

Shouldn't it be the default behavior be to avoid duplicates??

If a video is present within the output directory, shouldn't it be assumed that it would be present in the archive?

no:

  • the user might want to download multiple formats, for example, both a webm, mp4 formats.
  • the user might use a different output template.
    ...

so, either download with the same environment(same executables(ex: ffmpeg) and dependencies(ex: pycrypto) present and the same configuration) or use a download archive.

  1. These files are not duplicates per se.
  2. As already pointed out by @remitamine you must use download archive feature specially designed for such scenarios.

Hindsight is always 20/20. I believe avoiding duplicates should be the default, and allowing extension duplicates should be a cli argument, but that's just me.

Although I don't quite agree, I appreciate the feedback and the awesome software!