Malayalam-Newspaper-Article-Dataset

Project does web scraping. It scraps articles from a malayalam newspaper(janmabhumi) website to create a corpus of news articles. Also a set of queries is created and corresponding ground truth answers is retrieved by a combination of bm25 method and tf-idf method. The dataset can be useful for creating tools like stemmer, stopwords removal, lemmatizers, etc...

##Note

This repo is obsolete, and scrapping does not work on the mentioned site. Rework required.

DATASET

Directly download the Datset from Dropbox

OR

Execution

Open the terminal (Ctrl+Alt+T) and execute the given commands

git clone https://github.com/ABHISHEKVALSAN/Malayalam-Newspaper-Article-Dataset
cd Malayalam-Newspaper-Article-Dataset
mkdir DataSet
pip install -r requirements.txt
python3 MalayalamScraping.py

PS

After running the last command, you'll see files being created in the DataSet directory
Lot of urls have files missing... It is usual
The scraping is website specific and hence donot work for other newspaper sites.

Project Requirements

Python
Pip installed
Internet connection

Contact me at email given below for assistance or raise an issue.

Email : abhiavk@iitk.ac.in

Related Works

A similar repo with Telugu DataSet can be found here.

veenasnair18/Malayalam-Newspaper-Article-Dataset

Malayalam-Newspaper-Article-Dataset

DATASET

OR

Execution

PS

Project Requirements

Related Works