The Amharic Corpus Analysis project is designed to create a comprehensive Amharic language dataset by scraping and aggregating text data from various online sources, filtering out non-Amharic content, and analyzing the word frequency distribution to gain insights into the Amharic language.
Table of Contents
This project is all about understanding and working with the Amharic language, which is spoken widely in Ethiopia and is one of Africa's most commonly used languages. We're collecting a lot of written Amharic text from different places on the internet, like news articles and websites, to create a valuable resource.
-
Collecting Amharic Text: We're developing a tool that automatically finds and gathers Amharic text from the internet. This way, we'll have a diverse and up-to-date collection of Amharic language examples.
-
Separating Amharic Text: Not all the text we find will be in Amharic. We're building a system that can figure out which parts are in Amharic and remove any text in other languages.
-
Analyzing Word Frequency: Once we have our collection of Amharic text, we'll study how often different words are used. This will help us understand which words are most common in the Amharic language.
-
Checking Zipf's Law: Many languages, including Amharic, follow a pattern called Zipf's law, where the frequency of words often follows a specific pattern. We'll investigate if Amharic follows this pattern too.
-
Sharing the Dataset: Finally, we'll make the curated Amharic collection and our analysis available to the public. This will allow researchers, linguists, and developers to use this valuable dataset for their own work.
Why is this project important?
-
Understanding Amharic: By studying and analyzing the Amharic language, we can learn more about how it works and how people use it.
-
Creating Useful Tools: The information we gather can be used to develop helpful language tools, like translation services or AI assistants.
-
Preserving Culture: Having a collection of Amharic text helps preserve the language and supports educational initiatives to keep Amharic alive and thriving.
-
Empowering the Community: By sharing the Amharic dataset with everyone, we want to help researchers, developers, and Amharic speakers make the most of this valuable language resource.
If you're interested in learning more about this project or want to contribute, please visit our GitHub repository. We welcome collaborators from different backgrounds, including linguists, developers, and anyone interested in Amharic.
Let's explore and celebrate the richness of the Amharic language together through data-driven analysis and shared resources!
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
To build and run this project, you'll need the following installed:
- CMake
- Gnuplot
- Python3
- Clone this repo:
git clone "http://github.com/sss"
- Create a build directory:
mkdir build cd build
- Run CMake to generate the build files:
cmake ../
- Build the project:
cmake --build . --config Release
- Run the built executable:
./Release/ACAT
- Example Run with Custom Parameters
./Release/ACAT 15 ..\tool\data\frequency_data.txt 1
In order to use tools in /tool directory you can simply install their dependencies by
pip install -r requirements.txt
- Make sure you have the necessary dependencies installed.
- Run the script from the command line:
python analyze_word_frequencies.py
- Enter the path to the input directory containing the text files.
- The script will create the
frequency_data_<timestamp>.txt
file in the same directory as the script, containing the sorted word frequencies.
- ✅ Visualize word frequency in Amharic language
- ✅ Implement Zipf's distribution
- ◻️ Add more data and analyze the converges the results
I warmly welcome contributions to this project! Whether you have a bug fix, a feature enhancement, or new ideas, I would love to see them. Feel free to fork the repository, make your changes, and submit a pull request
Don't forget to give the project a star!
GNU GENERAL PUBLIC LICENSE. See LICENSE.txt
for more information.
Eual Uchiha - Telegram - Uchihaeual11@gmail.com Project Link: https://github.com/github_username/repo_name