- Aditya Mehta - 22110017
- Daksh Jain - 22110066
- Hrriday Ruparel - 22110099
- Kishan Ved - 22110122
- Sumeet Sawale - 22110234
This project involves scraping article data, downloading large datasets, cleaning Nepali text files, and generating hashes for deduplication. The pipeline follows these steps: Scraping, Downloading, Cleaning, Hash Generation, and Deduplication.
Google Sheet containing information about datasets scrapped and downloaded
HuggingFace containing scrapped datasets that were collected and cleaned on local machines.
- Scraped a webpage containing only links to various articles.
- Collected all links and stored them in a CSV file.
- Iterated over the CSV file and opened each link (webpage) using
webdriver
/BeautifulSoup
. - Scraped each webpage and extracted Nepali data present in
<p>
tags inside specific<div>
tags. - Created a
.txt
file for every webpage.
- Downloaded datasets, most of which were in parquet format.
- Used a Python script (
pandas
) to convert parquet files into text files. - Separated the text files into individual article texts using unique identifiers or line breaks.
- Cleaned the files for bad word removal and non-Devanagari character removal.
- Used pandas
ctext
techniques to convert CSV articles into text files.
- Collected a list of bad words from various online sources.
- Translated English bad words into Nepali using translation libraries.
- Cross-verified the bad word list with a native Nepali speaker, Sidharth (+91 93457 66974), for accuracy.
- Saved the final list in the
Nepali_Bad_Words.txt
file. - Iterated over all
.txt
files in the given folder:- If bad word(s) were present (as per the
Nepali_Bad_Words.txt
file), the file was moved to thebad_texts
folder. - If no bad words were found, all non-Devanagari characters, except for whitespaces and some punctuation marks, were removed.
- The cleaned text was written to a new
.txt
file in thegood_texts
folder.
- If bad word(s) were present (as per the
-
For each
.txt
file scraped or downloaded, two hash values (MinHash and SimHash) were generated to assist with deduplication tasks.- MinHash: Uses 128 hash functions to generate 128 hash values and compute the Jaccard distance between two documents.
- SimHash: A 128-bit hash for each
.txt
file.
-
For very large datasets containing only a single document, hash generation was skipped as they couldn't be meaningfully separated into smaller documents.
-
Distance Metrics:
- Jaccard distance (MinHash): Lower the value, higher the similarity.
- Hamming distance (SimHash): Lower the value, higher the similarity.
-
LSH (Locality-Sensitive Hashing) was used to compare hash values for deduplication.
- Scraped:
- Feature articles from Ekantipur.
- Opinions articles from Ekantipur.
- Books from Internet Archive: Bhagavad Gita, Mahabharata, Ramayan, 3 Vedas, and 7 parts of Bhagwat Mahapuran.
- Articles from Online Khabar.
- Developed code for:
- Scraping Ekantipur and Online Khabar articles.
- Generating SimHash for all text files.
- Parallelizing scraping on the server for Ekantipur and Online Khabar.
- Cleaning data by removing bad words and non-Nepali (Devanagari) symbols from all text files.
- Created a Hugging Face repository and uploaded the scraped files.
- Scraped:
- Blogs: Ghumante, Mysansar.
- Literature articles: Nai Academy, Shabd Sopan, Samakalin Sahitya.
- News articles: Annapurna Post, Desh Sanchar, Dainik Nepal, DcNepal, Farakdhar, Makalu Khabar, Sagarmatha TV, Sahitya Post.
- Developed a BeautifulSoup + Requests implementation for faster dataset curation compared to Chromedriver.
- Added functionality of ThreadPoolExecutor to all scripts for enhanced speed.
- Scraped:
- Articles from Online Khabar and HimalKhabar.
- Downloaded:
- CulturaX dataset.
- English to Nepali translation dataset.
- Developed code for:
- Converting Parquet files to CSV.
- Generating MinHash for deduplication.
- Cleaning and generating hashes for the downloaded datasets.
- Scraped:
- Poems from a website.
- Examiner Data.
- Collected and verified bad words for the Nepali language with the help of a native speaker.
- Downloaded and cleaned 4 datasets (50GB combined) from Hugging Face.
- Separated 2 of these datasets and generated hashes for them.
- Contact: Sidharth (+91 93457 66974).
- Scraped:
- Articles from News24Nepal.
- Data from Nepal Wiki.
- Cleaned and hashed both datasets.
- Downloaded:
- 3 datasets (AllenAI, Statmt, Oscar).
- A large-scale Nepali text corpus dataset from IEEE, split it into individual files while maintaining semantics, cleaned, and hashed them.
- Developed code for parallelizing scraping using ThreadPoolExecutor for the listed tasks.
Siddharth: For verifying bad words collected
Rakesh Thakur: For provding valuable information about Nepali Datasets