/CopyCatcher

Primary LanguageC#MIT LicenseMIT

Copy Catcher

Table of Contents:

Overview

Copy Catcher is a NuGet package designed to identify and list duplicate files within a specified directory. It uses advanced techniques and optimizations to ensure efficient and accurate detection of files with identical content.

Key Benefits & Features

  • Buffered Reading: Copy Catcher uses buffered reading to efficiently read large files in chunks, reducing memory usage and enhancing performance.

  • Asynchronous Operations: The package is designed to leverage asynchronous operations, ensuring non-blocking I/O operations. This results in a smoother user experience, especially when dealing with large directories or files.

  • Early Byte Exiting: Before hashing the entire file, Copy Catcher checks the initial bytes of files. If two files have different initial bytes, they are immediately identified as distinct, saving computational resources.

  • Chunk Hashing: Instead of hashing the entire file in one go, Copy Catcher hashes files in chunks. This approach is more memory-efficient and allows for faster identification of large duplicate files.

  • Parallelism: The package employs parallel processing to scan and hash multiple files concurrently. This takes full advantage of multi-core processors, drastically reducing the time required to identify duplicates in large directories.

Getting Started

Prerequisites

  • .NET SDK installed on your machine.
  • A .NET project where you want to use Copy Catcher.

Installation

Install the Copy Catcher NuGet package using the NuGet Package Manager:

Install-Package CopyCatcher

Or using the .NET CLI:

dotnet add package CopyCatcher

Usage

Integration

In your .NET project, add the following using directive:

using CopyCatcher.Shared;

Create an instance of the DuplicateFinderService:

var service = new DuplicateFinderService("path/to/directory");

Call the FindDuplicates method:

var duplicates = service.FindDuplicates();

Output

The FindDuplicates method will return a dictionary where keys are hash values and values are lists of file paths that have the same hash:

{
    "abc123def456": ["path/to/duplicate1.txt", "path/to/duplicate2.txt"],
    ...
}

Console App Example

A simple .NET Console app using Copy Catcher would look like this:

using CopyCatcher;

Console.WriteLine("Enter the directory path:");
var directoryPath = Console.ReadLine();

// Initialize the service and find duplicates
var duplicateFinderService = new DuplicateFinderService(directoryPath);
var duplicates = duplicateFinderService.FindDuplicates();

// Display results
foreach (var duplicate in duplicates)
{
    Console.WriteLine($"Hash: {duplicate.Key}");
    foreach (var filePath in duplicate.Value)
    {
        Console.WriteLine($" - {filePath}");
    }
}

How It Works

Components

  • FileReader: Reads files from the file system.
  • FileHasher: Computes a hash value for each file to determine duplicates.
  • DirectoryScanner: Scans the specified directory and retrieves a list of all files. It uses the DirectoryProvider to access the file system, ensuring better testability and separation of concerns.
  • DirectoryProvider: Provides direct access to the file system, used by DirectoryScanner.
  • DuplicateFinderService: The main service that ties all components together and provides an easy-to-use interface for finding duplicates.

Workflow

  1. The user specifies a directory to be scanned.
  2. DirectoryScanner retrieves a list of all files in the directory.
  3. FileHasher computes a hash for each file.
  4. Duplicate files are identified based on their hash values and returned in a dictionary.