Understood! Here's the updated README reflecting that there is only one Sukhotin's Algorithm, which is the modified one:

Vowel Identification Algorithms

Overview

This repository contains implementations of algorithms designed to identify vowels in texts by analyzing character adjacency patterns and co-occurrence statistics. By leveraging the statistical properties of letters in natural languages, these algorithms can classify letters as vowels or consonants, making them valuable tools for linguists, cryptanalysts, and language enthusiasts.

The repository includes:

Sukhotin's Algorithm: An implementation based on the work of B.V. Sukhotin, which iteratively classifies letters as vowels or consonants based on adjacency sums. This script has been modified to process the same dataset as Algorithm 1 and outputs its results to a separate file. It includes detailed in-line comments explaining the logic of the algorithm, making it easier to understand and modify.
Spectral Decomposition Algorithm (Algorithm 1): An implementation of Algorithm 1 from the paper:

"Vowel and Consonant Classification through Spectral Decomposition"
Patricia Thaine and Gerald Penn
Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 82–91, Copenhagen, Denmark, September 7, 2017.

Both algorithms operate independently of any specific language, making them applicable to a wide range of texts and scripts.

Features

Language-Agnostic: Process texts in any language and identify vowel patterns, regardless of orthography.
Unsupervised Learning: No training data or labeled examples are required; the algorithms discover patterns inherent in the text data itself.
Efficient Computation: Utilize NumPy for matrix operations and efficient algorithms optimized for performance.
Detailed Explanations: Sukhotin's algorithm script includes comprehensive in-line comments, aiding in understanding the implementation.
File Output: Both algorithms output results to their own files for easy analysis and record-keeping.
High Accuracy: Demonstrated strong performance in correctly classifying vowels and consonants across multiple languages.
Alphabetical Output: Results are sorted alphabetically for easy interpretation.

Why the Algorithms Work

Sukhotin's Algorithm

Sukhotin's algorithm relies on fundamental statistical properties observed in natural languages:

Alternation of Vowels and Consonants: Vowels and consonants frequently alternate in words, leading to characteristic adjacency patterns. This behavior helps the algorithm detect letters that function as vowels.
Frequency of Vowels: Vowels tend to appear frequently and in multiple combinations, creating higher adjacency counts, which the algorithm leverages for classification.

Spectral Decomposition Algorithm (Algorithm 1)

The spectral decomposition algorithm is based on the observation that vowels and consonants exhibit different co-occurrence patterns in text:

Character Contexts (P-Frames): By examining the immediate preceding and succeeding characters around each letter (p-frames), the algorithm captures the local context in which letters appear.
Spectral Decomposition: Applying Singular Value Decomposition (SVD) to the matrix constructed from these contexts reveals underlying structures in the data. The second right singular vector, in particular, tends to separate vowels and consonants due to their distinct distributional properties.
Most Frequent Letter Assumption: The algorithm assumes that the most frequent letter in the text is a vowel—a reasonable assumption for many languages (e.g., 'e' in English).

Algorithm Workflows

Sukhotin's Algorithm Workflow

1. Reading the Dataset

The script reads from the dataset, ensuring consistency in data processing with Algorithm 1.
Language text files are read from the Test/data/ directory, named according to their language codes.

2. Preprocessing the Text

Convert text to lowercase and remove non-alphabetic characters, standardizing the input for analysis.

3. Creating the Frequency Matrix

Create a symmetric matrix where each cell (i, j) represents the number of times character i is adjacent to character j in the text.
Exclude self-adjacency (e.g., repeated characters like 'aa') to focus on interactions between different characters.

4. Calculating Adjacency Sums

Sum each character’s adjacency counts, highlighting characters with high connectivity to others, often vowels.

5. Iterative Classification of Vowels and Consonants

Iteratively classify characters with the highest remaining adjacency sums as vowels, adjusting the counts to minimize the influence of previously classified vowels.
Adjustments involve subtracting twice the adjacency counts of the classified character from the remaining sums.

6. Threshold-Based Reclassification

Use an adjustable threshold to ensure that vowels with slightly lower adjacency sums are not misclassified.
Reclassify consonants with adjusted sums above the threshold as vowels.

7. Sorting the Results

Sort classified vowels and consonants alphabetically for clarity.

8. Outputting Results to File

The script writes the classification results to its own output file (suxotin_algorithm_output.txt), including execution time and any errors encountered.

Spectral Decomposition Algorithm Workflow (Algorithm 1)

1. Preprocessing the Text

Convert text to lowercase and remove non-alphabetic characters.
Tokenize the text into words and letters for analysis.

2. Constructing the P-Frame Matrix

For each letter in the text, create a p-frame consisting of its immediate preceding and succeeding characters with the letter itself replaced by a placeholder (e.g., (prev_char, '*', next_char)).
Build a binary matrix A where each row represents a unique p-frame, and each column represents a letter. An entry A[i, j] = 1 indicates that letter j occurs in p-frame i.

3. Singular Value Decomposition (SVD)

Perform SVD on matrix A to obtain matrices U, Σ, and V^T.
Extract the right singular vectors in V, focusing on the second vector V[:, 1].

4. Classifying Letters

Split letters into two clusters based on the sign of their corresponding component in the second right singular vector.
Assign labels: The cluster containing the most frequent letter is labeled as vowels; the other as consonants.

5. Output Results

Present the classified vowels and consonants in alphabetical order.
Results are written to algorithm1_output.txt.

Performance and Evaluation

Sukhotin's Algorithm

The algorithm was tested across multiple languages with distinct vowel and consonant sets. Here are the results, including precision, recall, and F1 scores:

Language	True Positives (TP)	False Positives (FP)	False Negatives (FN)	Precision	Recall	F1 Score
German	12	1	0	0.923	1.000	0.960
French	17	1	1	0.944	0.944	0.944
Spanish	11	2	0	0.846	1.000	0.917
Italian	12	2	4	0.857	0.750	0.800
Dutch	12	0	1	1.000	0.923	0.960
Greek	20	3	2	0.870	0.909	0.889
English	9	0	0	1.000	1.000	1.000
Swedish	12	2	0	0.860	1.000	0.920
Portuguese	15	3	3	0.830	0.830	0.830
Finnish	10	2	0	0.830	1.000	0.910

Performance Summary

Strong Precision: The algorithm achieved good precision across most languages, with English showing perfect precision (1.0). Most other languages maintained precision above 0.84.
Varied Recall: Recall scores ranged from perfect (1.0) to lower values, demonstrating good but not perfect detection of vowel patterns, including accented variants.
F1 Scores: Strong overall performance, with English achieving a perfect score and other languages showing solid results.

Spectral Decomposition Algorithm (Algorithm 1)

The algorithm was tested using the same dataset as Sukhotin's algorithm and demonstrated high accuracy in classifying vowels and consonants across multiple languages.

Example Results

Processing language: English (eng)
Vowels in English: a, e, i, o, u
Consonants in English: b, c, d, f, g, h, j, k, l, m, n, p, q, r, s, t, v, w, x, y, z

Accuracy: The algorithm correctly identified all standard English vowels and classified consonants accurately.
Handling Ambiguous Letters: The letter 'y' was classified based on its predominant usage in the corpus.

Repository Contents

suxotin_algorithm.py: Python script implementing Sukhotin's algorithm, modified to read from the same dataset as Algorithm 1 and output results to its own file. Includes detailed in-line comments.
algorithm1.py: Python script implementing Algorithm 1 from the paper.
lang_code.csv: CSV file containing language codes and names used for processing.
Test/data/: Directory containing language text files named according to their language codes.
Documentation Files:
- Decipherment_Models_1973.pdf: Original version of Sukhotin's algorithm.
- guySuxotin.pdf: An alternative version of the algorithm.
- vowel-algorithm.txt: Text version of the algorithm for further exploration.
Vowel and Consonant Classification through Spectral Decomposition.pdf: Paper detailing Algorithm 1.
README.md: This file, summarizing the algorithms and instructions.

Installation

Prerequisites

Python 3.6 or higher
Pip (Python package installer)

Steps

Clone the Repository

Using HTTPS:

git clone https://github.com/your_username/your_repository.git

Or using GitHub CLI:

gh repo clone your_username/your_repository

Navigate to the Repository
```
cd your_repository
```

Create a Virtual Environment (Optional but Recommended)

python -m venv venv
source venv/bin/activate  # On Windows, use venv\Scripts\activate

Install the Required Dependencies
```
pip install -r requirements.txt
```
Note: If requirements.txt is not provided, install the necessary packages manually:
```
pip install numpy pandas
```
Prepare the Data
- Place your language text files in the Test/data/ directory.
- Ensure that each file is named exactly as the language code specified in lang_code.csv, without any file extension.
- Example:
```
Test/data/eng
Test/data/deu
Test/data/fra
```
Prepare the lang_code.csv File
- Ensure that lang_code.csv is in the same directory as the scripts.
- The CSV should have at least two columns: code and language.

Usage

Running Sukhotin's Algorithm

Execute the Script
```
python suxotin_algorithm.py
```
View the Results
- Results will be saved to suxotin_algorithm_output.txt.
- The file will contain the classified vowels and consonants for each language, including execution times and any errors.

Running the Spectral Decomposition Algorithm (Algorithm 1)

Execute the Script
```
python algorithm1.py
```
View the Results
- Results will be saved to algorithm1_output.txt.
- The file will contain the classified vowels and consonants for each language.

Understanding the Code

Sukhotin's Algorithm (`suxotin_algorithm.py`)

Reading the Dataset: The script reads language text files from the Test/data/ directory, ensuring it processes the same dataset as Algorithm 1.
Detailed In-Line Comments: The script includes comprehensive comments explaining each step, making it easier to understand and modify.
Functions:
- get_language_data: Reads text data for a specified language code.
- preprocess_text: Cleans the text by converting to lowercase and removing non-alphabetic characters.
- create_frequency_matrix: Constructs a symmetric adjacency matrix of character pairs.
- classify_vowels: Implements the core of Sukhotin's algorithm, classifying characters based on adjacency sums.
- run_analysis_for_language: Orchestrates the processing for each language and collects results.
Output: Results are written to suxotin_algorithm_output.txt, including execution times for performance insights.

Spectral Decomposition Algorithm (`algorithm1.py`)

Functions:
- vowel_consonant_classification: Classifies letters based on the signs of the components in the second right singular vector obtained from SVD.
- algorithm1: Prepares the corpus, constructs the p-frame matrix, performs SVD, and calls the classification function.
- run_algorithm_for_language: Processes each language and writes results to an output file.
Key Libraries:
- NumPy: Used for numerical computations and performing SVD.
- Pandas: For handling CSV data and data manipulation.
File Output: Results are saved to algorithm1_output.txt.

Extending the Algorithms

Different Corpora: Modify the scripts to use other corpora or your own text data by adjusting the test_folder variable and ensuring your text files are properly formatted.
Language Adaptation: Apply the algorithms to texts in other languages. Validate assumptions (e.g., most frequent letter being a vowel) for each language.
Threshold Adjustment (Sukhotin's Algorithm): Experiment with different threshold values in the classify_vowels function to optimize performance for specific languages or datasets.
Detailed Analysis: Use the execution times and logging to analyze performance bottlenecks or to compare algorithm efficiency across languages.

References

Sukhotin's Algorithm:
- Sukhotin, B.V. (1962). Eksperimental’noe vydelenie klassov bukv s pomoshch’ju elektronnoj vychislitel’noj mashiny.
Spectral Decomposition Algorithm:
- Thaine, P., & Penn, G. (2017). Vowel and Consonant Classification through Spectral Decomposition. In Proceedings of the First Workshop on Subword and Character Level Models in NLP (pp. 82–91). Copenhagen, Denmark: Association for Computational Linguistics. Link to Paper

License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.

Acknowledgments

B.V. Sukhotin: For the foundational vowel identification algorithm.
Patricia Thaine and Gerald Penn: For their work on vowel and consonant classification through spectral decomposition.
NLTK Team: For providing a comprehensive toolkit for natural language processing in Python.
OpenAI: For offering guidance on best practices in AI and code generation.

This README was updated to include a modified implementation of Sukhotin's algorithm that reads from the same dataset as Algorithm 1 and outputs its results to a separate file. The script includes detailed in-line comments explaining the logic of the algorithm, enhancing the project's educational value and usability.

jhnwnstd/suxotin

Vowel Identification Algorithms

Overview

Features

Why the Algorithms Work

Sukhotin's Algorithm

Spectral Decomposition Algorithm (Algorithm 1)

Algorithm Workflows

Sukhotin's Algorithm Workflow

1. Reading the Dataset

2. Preprocessing the Text

3. Creating the Frequency Matrix

4. Calculating Adjacency Sums

5. Iterative Classification of Vowels and Consonants

6. Threshold-Based Reclassification

7. Sorting the Results

8. Outputting Results to File

Spectral Decomposition Algorithm Workflow (Algorithm 1)

1. Preprocessing the Text

2. Constructing the P-Frame Matrix

3. Singular Value Decomposition (SVD)

4. Classifying Letters

5. Output Results

Performance and Evaluation

Sukhotin's Algorithm

Performance Summary

Spectral Decomposition Algorithm (Algorithm 1)

Example Results

Repository Contents

Installation

Prerequisites

Steps

Usage

Running Sukhotin's Algorithm

Running the Spectral Decomposition Algorithm (Algorithm 1)

Understanding the Code

Sukhotin's Algorithm (suxotin_algorithm.py)

Spectral Decomposition Algorithm (algorithm1.py)

Extending the Algorithms

References

License

Acknowledgments

Sukhotin's Algorithm (`suxotin_algorithm.py`)

Spectral Decomposition Algorithm (`algorithm1.py`)