/duplicate-file-finder

A duplicate file finder.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0


Duplicate file finder:

Yes, another duplicate file finder with Python...

There are various approaches to finding duplicate files using Python, but when working with a large number of files, such as 2400 files, a simple script may be inefficient and may consume a large amount of memory, potentially causing the environment to crash. In this case, we tried using a tool, but found that the analysis process was time-consuming and the user interface was not very efficient.

In my opinion, a duplicate file is one that has the same content as another file, which can be determined by comparing its size and hash value. As a solution, we decided to use Python to filter the files by size and to enhance the hash step using the xxhash library. This allowed us to effectively identify and handle duplicate files in a more efficient manner.

Whats this script do:

  • file duplicate analysis
  • report through the ssh
  • dump a JSON file

What this script dont do:

  • delete files
  • make the cofee
  • bitcoin analysis

Use:

$ find-duplicate
USAGE: find-duplicate [-h] [--version] [--debug] [--logfile]
                      [--dump] -p PATH

DESCRIPTION:
    This module find duplicate files in a path using "-p <path>" option
    with the command line.

OPTIONS:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --debug               print debug messages to stderr
  --logfile             generate a logfile - "report.log"
  --dump                generate a summary - "summary_<id>.json"

REQUIRED ARGUMENTS:
  -p, --path PATH       define the /path/to/check

COMPATIBILITY:
    Python 3.7+ - https://www.python.org/

EXIT STATUS:
    This script exits 0 on success, and >0 if an error occurs.

Compatibility:

Python 3.7+

Setup:

  • User:

Get the package:

git clone https://github.com/francois-le-ko4la/duplicate-file-finder.git

Change to the folder:

cd duplicate-file-finder

Install with make on Linux/Unix/MacOS or use pip3 otherwise:

make install
  • Dev environment:

Get the package:

git clone https://github.com/francois-le-ko4la/duplicate-file-finder.git

Change to the folder:

cd duplicate-file-finder

Create your environment with all dev prerequisites and install the package:

make venv
source venv/bin/activate
make dev

Test:

This module has been tested and validated on Ubuntu. Test is available if you set up the package with dev environment.

make test

License:

This package is distributed under the GPLv3 license

Dev notes

TOML file:

# -*- coding: utf-8 -*-
[project]
name = "duplicatefile"
version = "0.1.1"
authors = [
  {name = "ko4la" }
]
description = "This module find duplicate files in a path."
license = {file = "LICENSE"}
requires-python = ">=3.7"
classifiers = [
    "Development Status :: 5 - Stable",
    "Environment :: Console",
    "Intended Audience :: Developers",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.7",
    "Programming Language :: Python :: 3.8",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3 :: Only",
    "OSI Approved :: GNU General Public License v3 (GPLv3)",

]
dependencies = [
    'rich>=12.6.0',
    'rich_argparse>=0.6.0',
    'xxhash>=3.1.0'
    ]

[project.optional-dependencies]
dev = [
    "pycodestyle>=2.3.1",
    "pytest>=7.2.0",
    "pylint",
    "mypy",
    "pydocstyle",
    "pytest-pylint",
    "pytest-pycodestyle",
    "pytest-mypy",
    "pytest-pydocstyle",
    "pytest-isort",
    "types-setuptools"]

[project.urls]
"Homepage" = "https://github.com/francois-le-ko4la/duplicate-file-finder"

[project.scripts]
find-duplicate = "duplicatefile.duplicatefile:main"

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[tool.pytest.ini_options]
minversion = "7.2"
addopts = [
    "-v",
    "--pycodestyle",
    "--doctest-modules",
    "--mypy",
    "--pydocstyle",
    "--pylint",
    "--isort",
    "--strict-markers"
]
python_files = ["*.py"]
xfail_strict = true
filterwarnings = [
    "ignore:.*U.*mode is deprecated:DeprecationWarning",
    "ignore::DeprecationWarning"]

[tool.mypy]
disallow_any_generics = true
disallow_untyped_defs = true
warn_redundant_casts = true
strict_equality = true

UML Diagram:

classDiagram
  class EventMSG {
    debug
    error
    info
    warning
  }
  class deque {
    iterable : list
    maxlen : int
    append(x)
    appendleft(x)
    clear()
    copy()
    count(x)
    extend(iterable)
    extendleft(iterable)
    index(x, start, end)
    insert(i, x)
    pop()
    popleft()
    remove(value)
    reverse()
    rotate(n)
  }
  class DetectDuplicate {
    num_of_files
    get_json() str
  }
  class EventMSG {
    debug : str
    error : str
    info : str
    warning : str
  }
  class ExitStatus {
    name
  }
  class LogMessages {
    args
    dump
    elapse_time
    logfile
    num_of_files
    path
    python
    python_import
    result
  }
  class LoggingSetup {
    default_format : str
    default_level : str
    encoding : str
    file_format : str
    json_dump : str
    log_file : str
    simple_format : str
  }
  class MyFile {
    hash : Optional[str]
    path : str
    size : Optional[int]
    get_hash(blocksize: int) MyFile
    get_size(path: str) MyFile
  }
  class Enum {
    name()
    value()
  }
  class IntEnum {
  }
  class ReprEnum {
  }
  class NamedTuple {
  }
  EventMSG --|> NamedTuple
  ExitStatus --|> IntEnum
  LogMessages --|> NamedTuple
  LoggingSetup --|> NamedTuple
  MyFile --|> NamedTuple
  IntEnum --|> ReprEnum
  ReprEnum --|> Enum
  EventMSG --* LogMessages : logfile
  EventMSG --* LogMessages : args
  EventMSG --* LogMessages : path
  EventMSG --* LogMessages : python
  EventMSG --* LogMessages : python_import
  EventMSG --* LogMessages : dump
  EventMSG --* LogMessages : result
  EventMSG --* LogMessages : num_of_files
  EventMSG --* LogMessages : elapse_time
  deque --* DetectDuplicate : __hash
  deque --* DetectDuplicate : __files


Objects:

ExitStatus()
LoggingSetup()
EventMSG()
LogMessages()
get_argparser()
MyFile()
MyFile.get_size()
MyFile.get_hash()
DetectDuplicate()
@Property DetectDuplicate.num_of_files()
DetectDuplicate.get_json()
check_python()
define_logfile()
check_arg()
dump_result()
main()

ExitStatus()

class ExitStatus(IntEnum):
Define Exit status.

LoggingSetup()

class LoggingSetup(NamedTuple):
Define logging Parameters.

Examples:

>>> my_setup = LoggingSetup()
>>> my_setup.default_level
'INFO'

EventMSG()

class EventMSG(NamedTuple):
Define Messages with different sev.

Attributes:
    info (str): message for info ("" by default)
    warning (str): message for warning ("" by default)
    error (str): message for error ("" by default)
    debug (str): message for debug ("" by default)

Examples:

    >>> logfile = EventMSG(info="Log file used: %s")
    >>> logfile.info
    'Log file used: %s'

LogMessages()

class LogMessages(NamedTuple):
Set standard logging messages.

get_argparser()

def get_argparser() -> ArgumentParser:
Define the argument parser.

This function define the argument parser and return it.
Returns:
    ArgumentParser

Examples:

    >>> a = get_argparser()
    >>> type(a)
    <class 'argparse.ArgumentParser'>

MyFile()

class MyFile(NamedTuple):
Describe a file with a NamedTuple.

@classmethod is used to init the objects correctly.

Notes:
    The objective is to define a file with only one NamedTuple.
    The NamedTuple will be created by the get_size function to define the
    path and size.
    Hash information consumes resources, and it will calculate later with
    get_hash function to create a new NamedTuple.

Examples:

    >>> test = MyFile.get_size("test_fic/doc.txt")
    >>> test
    MyFile(path='test_fic/doc.txt', size=544)
    >>> test = test.get_hash()
    >>> test
    MyFile(path='test_fic/doc.txt', size=544, hash='a5cd732df22bfdbd')

MyFile.get_size()

@classmethod
def MyFile.get_size(cls, path: str) -> MyFile:
Define a Myfile obj with path.

This function create the MyFile object with the file's path,
file's size is initialized by default and hash is None by default.
The path is not tested here, because we use os.walk to get the file
list.

Args:
    path: The file's path.

Returns:
    MyFile

Examples:

    >>> test = MyFile.get_size('test_fic/doc.txt')
    >>> test
    MyFile(path='test_fic/doc.txt', size=544)

MyFile.get_hash()

def MyFile.get_hash(self, blocksize: int = 65536) -> MyFile:
Calculate file's hash and generate a new obj.

This function is used on an existing Myfile obj and recreate
a new MyFile obj.

Args:
  blocksize:  blocksize used to read the file. (Default value = 65536)

Returns:
    MyFile

    >>> test = MyFile.get_size('test_fic/doc.txt')
    >>> test = test.get_hash()
    >>> test
    MyFile(path='test_fic/doc.txt', size=544, hash='a5cd732df22bfdbd')

DetectDuplicate()

class DetectDuplicate():
Class to organize and find duplicate files.

@Property DetectDuplicate.num_of_files()

@property
def DetectDuplicate.num_of_files(self) -> int:
Get the number of files.

Returns:
    int: Number of files.

Examples:

    >>> a = DetectDuplicate('test_fic/')
    >>> a.num_of_files
    5

DetectDuplicate.get_json()

def DetectDuplicate.get_json(self) -> str:
Get the result (JSON format).

Returns:
    str: result (JSON).

Examples:

    >>> a = DetectDuplicate('test_fic/')
    >>> print(a.get_json())
    {
        "path": "test_fic/",
        ...

check_python()

def check_python() -> bool:
Check python version.

This function check Python version, log the result and return a status
True/False.

Returns:
    True if successful, False otherwise.

Examples:

    >>> check_python()
    True

define_logfile()

def define_logfile() -> None:
Define the logfile.

This function set up the log to push log events in the report file.

check_arg()

def check_arg(args: Namespace) -> bool:
Check user's arguments.

This function check user's arguments, log info/error and return a status
True/False.

Args:
  args: Namespace.

Returns:
    True if successful, False otherwise.

Examples:

    >>> myargs = Namespace(path='/etc/')
    >>> check_arg(myargs)
    True

dump_result()

def dump_result(data: str) -> bool:
Dump the result in a JSON file.

This function dump the JSON, log info/error and return a status True/False.

Args:
  data: JSON str.

Returns:
    True if successful, False otherwise.

main()

def main() -> ExitStatus:
Define the main function.

Returns:
    int: exit value