Note: This is a markdown file, and for better understanding nust be opened with an appropriate viewer. One way to do so is here
Sarthak Agrawal - 2019115003
Methodology
Technology used is Scrapy because it is considered extremely fast, and allows both precision and accuracy since we can use CSS selectors to target elements.
-
Loksatta
The
loksatta_articles.csv
file does not have a header and has~
as delimiter. It has columns title, description, date, content (in this order). Fileall_loksatta_articles.csv
has these headers as the first line and|
as delimiter, with the addition of URLs for each article as the first column. -
Maharashtra times
maharashtra_times_articles.csv
does not have a header and has the columns title, author, date, highlights, content (in this order) and|
as delimiter. Fileall_mt_times_articles.csv
has these headers as the first line, with the addition of URLs for each article as the first column.
Instructions to run/reproduce:
-
Scrapy and python are necessary
-
There are two spiders, coded in the files scrape_url and scrape_art. The former scrapes the URLs of the articles and stores them in a file 'urls.txt'. The second spider scrapes each article and stores in 'articles.csv'
-
Run (while in current directory)
scrapy crawl urls_spi python filter_urls.py scrapy crawl each_article
- Stopwords obtained from Kaggle and CLTK, the latter needed editing. Also added from this repo and manually as well. Was excited on seeing LTRC recommended on first page result, but the file contains only garbage :(
- Font obtained from lipikaar, another didn't work.
- Issues referred:
- amueller/word_cloud#70
- amueller/word_cloud#367
- amueller/word_cloud#272
- amueller/word_cloud#562 (finally solved here!)
Tried (Contextualized Topic Modelling)[https://github.com/MilaNLProc/contextualized-topic-models]. Zero-shot modelling gave garbage (that too all english) output for data, and CombinedTM gave relevant output, but all english again, which is stored in CTMtopics.txt. Both were employed only on 20 articles, which are stored in temp_files
directory.