- Get the latest Tamil Wikipedia dump from the following url
here - You will download a file named tawiki-latest-pages-articles.xml.bz2
- Extract it using the following command:
bunzip2 tawiki-latest-pages-articles.xml.tar.bz2
This project consists of a Bash script (line.sh
) and a Python script (tamil-word-sorter.py
) that should be run one after another to process the xml file, extract Tamil words, and analyze their frequency.
line.sh
: Bash script for initial text processing and word extractiontamil-word-sorter.py
: Python script for Tamil word identification, sorting, and frequency analysis- Input xml file (user-provided)
- Bash shell
- Python 3.x
- The following Python modules (all part of the standard library):
collections
csv
-
Run the Bash script to process the input file and create a list of words:
bash line.sh
You will be prompted to enter the filename of your input text file. Enter the file name you downloaded and extracted from Tamil Wikipedia dump.
-
After the Bash script completes, run the Python script:
python3 tamil-word-sorter.py
- You will be prompted to enter the filename of your input xml file. Enter the file name you downloaded and extracted from Tamil Wikipedia dump.
The Python script generates several output files:
tamil-words.txt
: Contains all Tamil words extracted from the input fileonly_uniq_tamil_words.txt
: Contains unique Tamil wordsonly_tamil_uniq_sorted_words.txt
: Contains unique Tamil words sorted alphabeticallytamil_words_by_frequency.csv
: A CSV file with Tamil words sorted by frequency (descending order)
This Bash script does the following:
- Prompts the user for an input filename
- Checks if the file exists
- Processes the file by:
- Removing English characters
- Replacing various punctuation and special characters with newlines
- Removing leading/trailing whitespace
- Removing empty lines
- Outputs the processed words to a file named
words
This Python script performs the following operations:
- Defines a function
is_tamil()
to identify Tamil words based on Unicode character ranges - Reads the
words
file created byline.sh
- Filters out non-Tamil words
- Generates various output files as described in the Output section
- Counts word frequencies and sorts words by frequency
- Provides console output about the number of unique words and the files created
- The scripts assume UTF-8 encoding for input and output files
- Error handling is implemented for file operations
- The Tamil word identification is based on the Unicode range 2944-3071
Feel free to fork this project and submit pull requests with any enhancements.
The license for the code is GPL V3.