/smash

Smash through to find duplicate files super fast by slicing files intelligently!

Primary LanguageGoApache License 2.0Apache-2.0

smash

GitHub license CI Go Report Card GitHub release

CLI tool to smash through to find duplicate files efficiently by slicing a file (or blob) into multiple segments and computing a hash using a fast non-cryptographic algorithm such as xxhash or murmur3.

Amongst the highlights of smash:

  • Super fast analysis of large files thanks to slicing.
  • Suited for finding duplicates on bandwidth constrained networks, devices or very large files but plenty capable on smaller ones!
  • Supports a variety of non-cryptographic algorithms (see algorithms supported).
  • Read-only view of the underlying filesystem when analysing
  • Reports on duplicate files & empty (0 byte) files
  • Outputs a report in json, you can use tools like jq to operate on (see examples below or the vhs tapes)
  • Used to dedupe multi-TB of astrophysics datasets, images and video content & run regularly to report duplicates

smash does not support pruning of duplicates or empty files natively and it's encouraged you vet the output report before pruning via automated tools.

Made with VHS
Find duplicates in the linux/drivers source tree with smash (see our 🍿 other demos). Made with vhs!

The name comes from a prototype tool called SmartHash (written many years ago in C/ASM that's now lost in source & too hard to modernise). It operated on a similar concept of slicing and hashing (with CRC32 then later MD5).

Installation

Operating Systems

You can download the latest binaries from Github Releases or via our simple installer script - which currently supports Linux, macos, FreeBSD & Windows:

bash <(curl -s https://raw.githubusercontent.com/thushan/smash/main/install.sh)

It will download the latest version & extract it to its own folder for you.

Alternatively, you can install it via go:

go install github.com/thushan/smash@latest

smash has been developed on Linux (Pop!_OS & Fedora), tested on macOS, FreeBSD & Windows.

Usage

Important

Starting from v0.9.0+, smash will only look for duplicates in the current folder, to smash sub-folders, use the --recurse or -r switch.

Usage:
  smash [flags] [locations-to-smash]

Flags:
      --algorithm algorithm    Algorithm to use to hash files. Supported: xxhash, murmur3, md5, sha512, sha256 (full list, see readme) (default xxhash)
      --base strings           Base directories to use for comparison Eg. --base=/c/dos,/c/dos/run/,/run/dos/run
      --disable-autotext       Disable detecting text-files to opt for a full hash for those
      --disable-meta           Disable storing of meta-data to improve hashing mismatches
      --disable-slicing        Disable slicing & hash the full file instead
      --exclude-dir strings    Directories to exclude separated by comma Eg. --exclude-dir=.git,.idea
      --exclude-file strings   Files to exclude separated by comma Eg. --exclude-file=.gitignore,*.csv
  -h, --help                   help for smash
      --ignore-empty           Ignore empty/zero byte files (default true)
      --ignore-hidden          Ignore hidden files & folders Eg. files/folders starting with '.' (default true)
      --ignore-system          Ignore system files & folders Eg. '$MFT', '.Trash' (default true)
  -L, --max-size int           Maximum file size to consider for hashing (in bytes)
  -p, --max-threads int        Maximum threads to utilise (default 16)
  -w, --max-workers int        Maximum workers to utilise when smashing (default 16)
  -G, --min-size int           Minimum file size to consider for hashing (in bytes)
      --nerd-stats             Show nerd stats
      --no-output              Disable report output
      --no-progress            Disable progress updates
      --no-top-list            Hides top x duplicates list
  -o, --output-file string     Export analysis as JSON (generated automatically otherwise)
      --profile                Enable Go Profiler - see localhost:1984/debug/pprof
      --progress-update int    Update progress every x seconds (default 5)
  -r, --recurse                Recursively search directories for files
      --show-duplicates        Show full list of duplicates
      --show-top int           Show the top x duplicates (default 10)
  -q, --silent                 Run in silent mode
      --slice-size int         Size of a Slice (in bytes) (default 8192)        
      --slice-threshold int    Threshold to use for slicing (in bytes) - if file is smaller than this, it won't be sliced (default 102400)
      --slices int             Number of Slices to use (default 4)
      --verbose                Run in verbose mode
  -v, --version                Show version information

See the full list of algorithms supported.

Examples

Examples are given in Unix format, but apply to Windows as well.

Tip

To recursively smash through directories, use the --recursive or -r switch.

By default, smash will only look in the current folder (from v0.9+)

Basic

To check for duplicates in a single path (Eg. ~/media/photos) & output report to report.json

$ ./smash ~/media/photos -r -o report.json

You can then look at report.json with jq to check duplicates:

$ jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l

Show Empty Files

By default, smash ignores empty files but can report on them with the --ignore-empty=false argument:

$ ./smash ~/media/photos -r --ignore-empty=false -o report.json

You can then look at report.json with jq to check empty files:

$ jq '.analysis.empty[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l

Show Top 50 Duplicates

By default, smash shows the top 10 duplicate files in the CLI and leaves the rest for the report, you can change that with the --show-top=50 argument to show the top 50 instead.

$ ./smash ~/media/photos -r --show-top=50

Multiple Directories

To check across multiple directories - which can be different drives, or mounts (Eg. ~/media/photos and /mnt/my-usb-drive/photos):

$ ./smash -r ~/media/photos /mnt/my-usb-drive/photos

Smash will find and report all duplicates within any number of directories passed in.

Exclude Files or Directories

You can exclude certain directories or files with the --exclude-dir and --exclude-file switches including wildcard characters:

$ ./smash -r --exclude-dir=.git,.svn --exclude-file=.gitignore,*.csv ~/media/photos

For example, to ignore all hidden files on unix (those that start with . such as .config or .gnome folders):

$ ./smash -r --exclude-dir=.config,.gnome ~/media/photos

Disabling Slicing & Getting Full Hash

By default, smash uses slicing to efficiently slice a file into multiple segments and hash parts of the file.

If you prefer not to use slicing for a run, you can disable slicing with:

$ ./smash -r --disable-slicing ~/media/photos

Changing Hashing Algorithms

By default, smash uses xxhash, an extremely fast non-cryptographic hash algorithm. However, you can choose a variety of algorithms as documented.

To use another supported algorithm, use the --algorithm switch:

$ ./smash -r --algorithm:murmur3 ~/media/photos

Acknowledgements

This project was possible thanks to the following projects or folks.

Testers - MarkB, JarredT, BenW, DencilW, JayT, ASV, TimW, RyanW, WilliamH, SpencerB, EmadA, ChrisE, AngelaB, LisaA, YousefI, JeffG, MattP

License

Copyright (c) Thushan Fernando and licensed under Apache License 2.0