/nepal-LLM

Aims to be a LLM for Nepalese. A team initiative in the NLP course at IIT Gandhinagar, Autumn 2024.

Primary LanguageJupyter Notebook

IITGN CS 613: Natural Language Processing - Assignment 1

Instructor: Prof. Mayank Singh

Team 3: Nepali NLP

Members:

  • Aditya Mehta - 22110017
  • Daksh Jain - 22110066
  • Hrriday Ruparel - 22110099
  • Kishan Ved - 22110122
  • Sumeet Sawale - 22110234

Overview

This project involves scraping article data, downloading large datasets, cleaning Nepali text files, and generating hashes for deduplication. The pipeline follows these steps: Scraping, Downloading, Cleaning, Hash Generation, and Deduplication.

Dataset information

Google Sheet containing information about datasets scrapped and downloaded

HuggingFace containing scrapped datasets that were collected and cleaned on local machines.

Pipeline

1. Scraping

  • Scraped a webpage containing only links to various articles.
  • Collected all links and stored them in a CSV file.
  • Iterated over the CSV file and opened each link (webpage) using webdriver / BeautifulSoup.
  • Scraped each webpage and extracted Nepali data present in <p> tags inside specific <div> tags.
  • Created a .txt file for every webpage.

2. Downloading

  • Downloaded datasets, most of which were in parquet format.
  • Used a Python script (pandas) to convert parquet files into text files.
  • Separated the text files into individual article texts using unique identifiers or line breaks.
  • Cleaned the files for bad word removal and non-Devanagari character removal.
  • Used pandas ctext techniques to convert CSV articles into text files.

3. Cleaning

  • Collected a list of bad words from various online sources.
  • Translated English bad words into Nepali using translation libraries.
  • Cross-verified the bad word list with a native Nepali speaker, Sidharth (+91 93457 66974), for accuracy.
  • Saved the final list in the Nepali_Bad_Words.txt file.
  • Iterated over all .txt files in the given folder:
    • If bad word(s) were present (as per the Nepali_Bad_Words.txt file), the file was moved to the bad_texts folder.
    • If no bad words were found, all non-Devanagari characters, except for whitespaces and some punctuation marks, were removed.
    • The cleaned text was written to a new .txt file in the good_texts folder.

4. Hash Generation

  • For each .txt file scraped or downloaded, two hash values (MinHash and SimHash) were generated to assist with deduplication tasks.

    • MinHash: Uses 128 hash functions to generate 128 hash values and compute the Jaccard distance between two documents.
    • SimHash: A 128-bit hash for each .txt file.
  • For very large datasets containing only a single document, hash generation was skipped as they couldn't be meaningfully separated into smaller documents.

5. Deduplication

  • Distance Metrics:

    • Jaccard distance (MinHash): Lower the value, higher the similarity.
    • Hamming distance (SimHash): Lower the value, higher the similarity.
  • LSH (Locality-Sensitive Hashing) was used to compare hash values for deduplication.

Individual Contributions

Kishan

  • Scraped:
    • Feature articles from Ekantipur.
    • Opinions articles from Ekantipur.
    • Books from Internet Archive: Bhagavad Gita, Mahabharata, Ramayan, 3 Vedas, and 7 parts of Bhagwat Mahapuran.
    • Articles from Online Khabar.
  • Developed code for:
    • Scraping Ekantipur and Online Khabar articles.
    • Generating SimHash for all text files.
    • Parallelizing scraping on the server for Ekantipur and Online Khabar.
    • Cleaning data by removing bad words and non-Nepali (Devanagari) symbols from all text files.
  • Created a Hugging Face repository and uploaded the scraped files.

Hrriday

  • Scraped:
    • Blogs: Ghumante, Mysansar.
    • Literature articles: Nai Academy, Shabd Sopan, Samakalin Sahitya.
    • News articles: Annapurna Post, Desh Sanchar, Dainik Nepal, DcNepal, Farakdhar, Makalu Khabar, Sagarmatha TV, Sahitya Post.
  • Developed a BeautifulSoup + Requests implementation for faster dataset curation compared to Chromedriver.
  • Added functionality of ThreadPoolExecutor to all scripts for enhanced speed.

Sumeet

  • Scraped:
    • Articles from Online Khabar and HimalKhabar.
  • Downloaded:
    • CulturaX dataset.
    • English to Nepali translation dataset.
  • Developed code for:
    • Converting Parquet files to CSV.
    • Generating MinHash for deduplication.
    • Cleaning and generating hashes for the downloaded datasets.

Aditya

  • Scraped:
    • Poems from a website.
    • Examiner Data.
  • Collected and verified bad words for the Nepali language with the help of a native speaker.
  • Downloaded and cleaned 4 datasets (50GB combined) from Hugging Face.
  • Separated 2 of these datasets and generated hashes for them.
  • Contact: Sidharth (+91 93457 66974).

Daksh

  • Scraped:
    • Articles from News24Nepal.
    • Data from Nepal Wiki.
  • Cleaned and hashed both datasets.
  • Downloaded:
    • 3 datasets (AllenAI, Statmt, Oscar).
    • A large-scale Nepali text corpus dataset from IEEE, split it into individual files while maintaining semantics, cleaned, and hashed them.
  • Developed code for parallelizing scraping using ThreadPoolExecutor for the listed tasks.

Acknowledgements

Siddharth: For verifying bad words collected

Rakesh Thakur: For provding valuable information about Nepali Datasets