Project does web scraping. It scraps articles from a malayalam newspaper(janmabhumi) website to create a corpus of news articles. Also a set of queries is created and corresponding ground truth answers is retrieved by a combination of bm25 method and tf-idf method. The dataset can be useful for creating tools like stemmer, stopwords removal, lemmatizers, etc...
##Note
This repo is obsolete, and scrapping does not work on the mentioned site. Rework required.
Directly download the Datset from Dropbox
Open the terminal (Ctrl+Alt+T) and execute the given commands
git clone https://github.com/ABHISHEKVALSAN/Malayalam-Newspaper-Article-Dataset
cd Malayalam-Newspaper-Article-Dataset
mkdir DataSet
pip install -r requirements.txt
python3 MalayalamScraping.py
- After running the last command, you'll see files being created in the DataSet directory
- Lot of urls have files missing... It is usual
- The scraping is website specific and hence donot work for other newspaper sites.
- Python
- Pip installed
- Internet connection
Contact me at email given below for assistance or raise an issue.
Email : abhiavk@iitk.ac.in
A similar repo with Telugu DataSet can be found here.