Levenshtein-Grouper
is a powerful tool for computing the Levenshtein distances between lines of text in files within a specified directory. Designed with performance and scalability in mind, this tool is perfect for tasks that require text matching and comparison. It is especially useful in security contexts like duplicate identification, pattern recognition, and anomaly detection.
- Multi-threaded Performance: Leverages the Rayon library for parallel computing.
- Progress Tracking: Offers real-time progress updates with Indicatif.
- JSON Export: Capability to export the results into a dynamically named JSON file.
- Fine-grained Control: Set your own limits for Levenshtein distance calculations.
- Color-coded Output: Easy-to-read, color-coded terminal output.
To install Levenshtein-grouper
, clone the repository first:
git clone https://github.com/copyleftdev/levenshtein-grouper.git
Navigate into the project directory:
cd levenshtein-grouper
Compile the code:
cargo build --release
The compiled binary will be located under target/release
.
Run the following command to compute Levenshtein distances among text strings within files in a specific directory:
levenshtein-grouper --path /path/to/directory
To set a maximum limit for the Levenshtein distance in the calculations:
levenshtein-grouper --path /path/to/directory --distance 5
To save the results as a dynamically named JSON file:
levenshtein-grouper--path /path/to/directory --json
-
Identifying Similar Code Blocks for Malware Analysis: Detect segments of code that are almost identical across different malware families. This can help identify the techniques or algorithms commonly used by attackers.
-
Phishing Email Detection: Compare the contents of incoming emails with known phishing templates to flag suspicious emails.
-
Password Strength Analysis: Check if a new password is too similar to previously compromised passwords in a leaked database.
-
Duplicate Content Detection: Identify almost identical blocks of text across multiple documents or web pages to avoid SEO penalties for duplicate content.
-
Text Similarity: Measure the similarity between different versions of the same article or blog post.
-
Plagiarism Check: Compare a document against a database of existing works to identify potential plagiarism.
-
Chatbots: Improve the accuracy of chatbot responses by measuring the similarity between the user input and the pre-defined queries.
-
Translation Memory Systems: Find similar sentences or paragraphs in a corpus of previously translated text to assist human translators.
-
Language Learning Apps: Identify common mistakes or alternative answers in language learning exercises.
-
Code Review: Highlight lines of code that are nearly identical and may be candidates for refactoring into a function.
-
Code Reusability: Search for similar code blocks across projects to identify potential libraries or modules that could be created for reusability.
Feel free to expand or refine these points to better suit the features and capabilities of your levenshtein-grouper
tool.
Contributions are welcome! Feel free to open a pull request.