Note: This is a markdown file, and for better understanding nust be opened with an appropriate viewer. One way to do so is here

RMHS Assignment 2 submission

Sarthak Agrawal - 2019115003

Articles archive

Methodology

Technology used is Scrapy because it is considered extremely fast, and allows both precision and accuracy since we can use CSS selectors to target elements.

  • Loksatta

    The loksatta_articles.csv file does not have a header and has ~ as delimiter. It has columns title, description, date, content (in this order). File all_loksatta_articles.csv has these headers as the first line and | as delimiter, with the addition of URLs for each article as the first column.

  • Maharashtra times

    maharashtra_times_articles.csv does not have a header and has the columns title, author, date, highlights, content (in this order) and | as delimiter. File all_mt_times_articles.csv has these headers as the first line, with the addition of URLs for each article as the first column.

Instructions to run/reproduce:

  • Scrapy and python are necessary

  • There are two spiders, coded in the files scrape_url and scrape_art. The former scrapes the URLs of the articles and stores them in a file 'urls.txt'. The second spider scrapes each article and stores in 'articles.csv'

  • Run (while in current directory)

    scrapy crawl urls_spi
    python filter_urls.py
    scrapy crawl each_article
    

Wordcloud

Topic Modelling

Tried (Contextualized Topic Modelling)[https://github.com/MilaNLProc/contextualized-topic-models]. Zero-shot modelling gave garbage (that too all english) output for data, and CombinedTM gave relevant output, but all english again, which is stored in CTMtopics.txt. Both were employed only on 20 articles, which are stored in temp_files directory.