/FdupesAnalyzer

A script to analyze output of fdupes linux utility to find level of overlap between directories. Written in R

Primary LanguageRMIT LicenseMIT

FdupesAnalyzer

A utility to analyze output of fdupes linux utility to find level of overlap between directories. Written in R. https://github.com/codecliff/FdupesAnalyzer

Why:

fdupes by Adrián López gives you a file-by-file list of duplicates. It works very well with renamed copies and files exported by image editors and such. However, to clean up a large dump of files accumulated over years by multiple users, I needed to see things like 70% of files in dir A are also in dir B, dir A has copies of all the files in dir B etc. This utility script creates a csv file with all this information.

How To Use:

  • Run fdupes and redirect results to file. fdupes -Sr rootpath >> fdupes_output.txt
  • Edit R script FDupesParser.R , update path for output file and rootpath.
  • Run R script (Preferably in interactive mode, preferably in RStudio)
  • Go over the csv file generated by script
  • (Optional) Generate fdupes commands for each directory pair and run as a batch

Output file formats:

Generated CSV file

  • "dir1" : directory 1
  • "dir2" : directory 2
  • "matchcnt": no. of files matching between dir1 and dir2
  • "acnt" : file count in dir1
  • "bcnt" : file count in dir2
  • "aprct" : percent of files in dir1 which have copy in dir2
  • "bprct" : same for dir2
  • "maxprct" : max of above two

Generated script file

sudo fdupes -dN "./imgs/music" "./imgs/2018-03-oldccombk/stuff/"
sudo fdupes -dN "./ntfs/2017-backup/weds" "./IMAGES/Pictures_2017/.mail_downloads"
sudo fdupes -dN "./IMAGES/Picture/weds" "./IMAGES/Pictures_2017/oldlaptop_hdd"

Prerequisites

  • R
  • R Packages : data.table, tools
  • fdupes

Tested on

  • Ubuntu 18.04
  • R 3.6.2
  • RStudio 1.1.463

License

MIT

© Rahul Singh 2020 https://github.com/codecliff/FdupesAnalyzer