/English-Persian-Tokenizer

This project is a simple tokenizer for text processing that can tokenize both Persian and English words.

Primary LanguagePythonMIT LicenseMIT

English-Persian Tokenizer

Overview

The English-Persian Tokenizer is a simple Python program that classifies input strings into English words or Persian words. It leverages a Deterministic Finite Automaton (DFA) to perform this classification, making it a handy tool for distinguishing English and Persian words within text.

Features

  • Tokenize input text into English and Persian words.
  • Utilizes a DFA for efficient classification.
  • Easily customizable for additional languages or character sets.

Usage

  1. Clone or download this repository to your local machine.

  2. Ensure you have Python installed (Python 3 is recommended).

  3. Open a terminal and navigate to the repository's directory.

  4. Run the tokenizer by executing the tokenizer.py script, providing the text you want to classify as an argument.

    python tokenizer.py "Your input text here."
    

Thank you for using the English-Persian Tokenizer!