/Malayalam-Newspaper-Article-Dataset

The project scraps articles from a malayalam newspaper website to create a corpus. A set of queries is created and corresponding ground truth answers is retrieved. This can be used as a dataset that can check new tools in future like malaylam stemmer, stopwords removal, lemmatizers, etc...

Primary LanguagePython

Malayalam-Newspaper-Article-Dataset

Project does web scraping. It scraps articles from a malayalam newspaper(janmabhumi) website to create a corpus of news articles. Also a set of queries is created and corresponding ground truth answers is retrieved by a combination of bm25 method and tf-idf method. The dataset can be useful for creating tools like stemmer, stopwords removal, lemmatizers, etc...

##Note

This repo is obsolete, and scrapping does not work on the mentioned site. Rework required.

DATASET

Directly download the Datset from Dropbox

OR

Execution

Open the terminal (Ctrl+Alt+T) and execute the given commands

git clone https://github.com/ABHISHEKVALSAN/Malayalam-Newspaper-Article-Dataset
cd Malayalam-Newspaper-Article-Dataset
mkdir DataSet
pip install -r requirements.txt
python3 MalayalamScraping.py

PS

  1. After running the last command, you'll see files being created in the DataSet directory
  2. Lot of urls have files missing... It is usual
  3. The scraping is website specific and hence donot work for other newspaper sites.

Project Requirements

  1. Python
  2. Pip installed
  3. Internet connection

Contact me at email given below for assistance or raise an issue.

Email : abhiavk@iitk.ac.in

Related Works

A similar repo with Telugu DataSet can be found here.