Copy Catcher

Overview

Copy Catcher is a NuGet package designed to identify and list duplicate files within a specified directory. It uses advanced techniques and optimizations to ensure efficient and accurate detection of files with identical content.

Key Benefits & Features

Buffered Reading: Copy Catcher uses buffered reading to efficiently read large files in chunks, reducing memory usage and enhancing performance.
Asynchronous Operations: The package is designed to leverage asynchronous operations, ensuring non-blocking I/O operations. This results in a smoother user experience, especially when dealing with large directories or files.
Early Byte Exiting: Before hashing the entire file, Copy Catcher checks the initial bytes of files. If two files have different initial bytes, they are immediately identified as distinct, saving computational resources.
Chunk Hashing: Instead of hashing the entire file in one go, Copy Catcher hashes files in chunks. This approach is more memory-efficient and allows for faster identification of large duplicate files.
Parallelism: The package employs parallel processing to scan and hash multiple files concurrently. This takes full advantage of multi-core processors, drastically reducing the time required to identify duplicates in large directories.

Getting Started

Prerequisites

.NET SDK installed on your machine.
A .NET project where you want to use Copy Catcher.

Installation

Install the Copy Catcher NuGet package using the NuGet Package Manager:

Install-Package CopyCatcher

Or using the .NET CLI:

dotnet add package CopyCatcher

Usage

Integration

In your .NET project, add the following using directive:

using CopyCatcher.Shared;

Create an instance of the DuplicateFinderService:

var service = new DuplicateFinderService("path/to/directory");

Call the FindDuplicates method:

var duplicates = service.FindDuplicates();

Output

The FindDuplicates method will return a dictionary where keys are hash values and values are lists of file paths that have the same hash:

{
    "abc123def456": ["path/to/duplicate1.txt", "path/to/duplicate2.txt"],
    ...
}

Console App Example

A simple .NET Console app using Copy Catcher would look like this:

using CopyCatcher;

Console.WriteLine("Enter the directory path:");
var directoryPath = Console.ReadLine();

// Initialize the service and find duplicates
var duplicateFinderService = new DuplicateFinderService(directoryPath);
var duplicates = duplicateFinderService.FindDuplicates();

// Display results
foreach (var duplicate in duplicates)
{
    Console.WriteLine($"Hash: {duplicate.Key}");
    foreach (var filePath in duplicate.Value)
    {
        Console.WriteLine($" - {filePath}");
    }
}

How It Works

Components

FileReader: Reads files from the file system.
FileHasher: Computes a hash value for each file to determine duplicates.
DirectoryScanner: Scans the specified directory and retrieves a list of all files. It uses the DirectoryProvider to access the file system, ensuring better testability and separation of concerns.
DirectoryProvider: Provides direct access to the file system, used by DirectoryScanner.
DuplicateFinderService: The main service that ties all components together and provides an easy-to-use interface for finding duplicates.

Workflow

The user specifies a directory to be scanned.
DirectoryScanner retrieves a list of all files in the directory.
FileHasher computes a hash for each file.
Duplicate files are identified based on their hash values and returned in a dictionary.

Programazing/CopyCatcher