Note: This is a markdown file, and for better understanding nust be opened with an appropriate viewer. One way to do so is here

RMHS Assignment 2 submission

Sarthak Agrawal - 2019115003

Articles archive

Methodology

Technology used is Scrapy because it is considered extremely fast, and allows both precision and accuracy since we can use CSS selectors to target elements.

Loksatta

The loksatta_articles.csv file does not have a header and has ~ as delimiter. It has columns title, description, date, content (in this order). File all_loksatta_articles.csv has these headers as the first line and | as delimiter, with the addition of URLs for each article as the first column.
Maharashtra times

maharashtra_times_articles.csv does not have a header and has the columns title, author, date, highlights, content (in this order) and | as delimiter. File all_mt_times_articles.csv has these headers as the first line, with the addition of URLs for each article as the first column.

Instructions to run/reproduce:

Scrapy and python are necessary
There are two spiders, coded in the files scrape_url and scrape_art. The former scrapes the URLs of the articles and stores them in a file 'urls.txt'. The second spider scrapes each article and stores in 'articles.csv'

Run (while in current directory)

scrapy crawl urls_spi
python filter_urls.py
scrapy crawl each_article

Wordcloud

Stopwords obtained from Kaggle and CLTK, the latter needed editing. Also added from this repo and manually as well. Was excited on seeing LTRC recommended on first page result, but the file contains only garbage :(
Font obtained from lipikaar, another didn't work.
Issues referred:
- amueller/word_cloud#70
- amueller/word_cloud#367
- amueller/word_cloud#272
- amueller/word_cloud#562 (finally solved here!)

Topic Modelling

Tried (Contextualized Topic Modelling)[https://github.com/MilaNLProc/contextualized-topic-models]. Zero-shot modelling gave garbage (that too all english) output for data, and CombinedTM gave relevant output, but all english again, which is stored in CTMtopics.txt. Both were employed only on 20 articles, which are stored in temp_files directory.

5arthak01/RMHS-marathi-wordcloud

RMHS Assignment 2 submission

Articles archive

Wordcloud

Topic Modelling