/CovidNewsWebScrapper

A project for web-scrapping covid news & case count from Worldometer & Wikipedia

Primary LanguagePython

Requirements

Project Structure

ProjectFolder
    ply/*
    crawler.ply
    wiki_crawler.ply
    data_extractor_1.py
    data_extractor_2.py
    data_extractor_3.py
    data_extractor_timeline.py
    data_extractor_responses.py
    data_extractor_country.py
    data_loader.py
    model.py
    main.py

For ease of use and evaluation the project is zipped as shown above.

Description of source files.

  1. crawler.py : Parses country list file and crawls https://www.worldometers.info/coronavirus/ (saved to home.html) and respective countries (saved to countries/.html)
  2. wik_crawler.py : Parses covid country list file and crawls https://en.wikipedia.org/wiki/Timeline_of_the_COVID-19_pandemic (saved to wiki_home.html) and respective country pagesare saved to data/countries folder
  3. data_extractor_1.py & data_extractor_3.py : Extracts information for Task 2 (Point 1) a to j)
  4. data_extractor_2.py : Extracts information for Task 2 (Point 3)
  5. data_extractor_3.py : Extracts Worldwide Timeseries information
  6. data_loader.py : Loads the extracted data into database
  7. data_extractor_timeline.py : Extracts Wikipedia Covid News
  8. data_extractor_responses.py : Extracts Wikipedia Covid Responses
  9. data_extractor_country.py : Extracts Wikipedia Covid Country News
  10. model.py : Contain class definitions for Database, TimeSeries etc
  11. main.py : Contains menu, index creation, and synchronization

Running the program

Installation

Install Wordcloud & NLTK

pip install wordcloud NLTK

Just run main.py

<python3_executable> main.py

Entering commands

There are three menus. Yesterday Data Menu, Time Range Menu and Wikipedia Menu. Users can toggle between the menus by entering menu <menu_id> command. At any menu help command can be used to view options available as well as an example. For proper clean up of files, user must type exit to exit the program. Example of one command shown below

  • Worldometers Yesterday Day -> Menu 1
  • Worldometers TimeSeries Data -> Menu 2
  • Worldometers Wikipedia Menu -> Menu 3
Time Range Menu |  Enter menu <menu id> to go menu | Enter help for usage
>> menu 1

Yesterday Data Menu |  Enter menu <menu id> to go menu | Enter help for usage

Commands parameters must be separated by '|'. Special keywords are used to mention the query used. The following short form has been adopted for the queries

Commands for Region Wise (Yesterday Data Menu)

  1. tc : Total Cases
  2. ac : Active Cases
  3. td : Total Deaths
  4. tr : Total Recovered
  5. tt : Total Tests
  6. dpm : Deaths per 1 milion of population
  7. tpm : Tests per 1 million of population
  8. nc : New Cases
  9. nd : New Deaths
  10. nr : New Recoveries

Command format

<regn_name> | <command>

Command for Time Series

  1. dc : Daily New Cases
  2. ac : Active Cases
  3. dd : Daily Deaths
  4. dr : Daily Recoveries

Command format

<date1> | <date2> | regn | <command>

Command for Wikipedia Series

  1. Q1: Shows world wide Covid News & Responses between 2 given dates and plots word-clouds for both

  2. Q2: Given two non-overlapping time ranges, do the following: Plots two different word clouds for all the common words (ignoring stopwords) and only covid related common words. Prints the percentage of covid related words in common words (ignore stopwords). Prints the top-20 common words (ignore stopwords) and covid related words.

  3. Q3: Given a country, displays date-range for which news is available.

  4. Q4 : Given a country, and date-range displays news in this date range, and plots word-cloud for the same.

  5. Q5 : Given a country, and a date-range, displays the top 3 closest countries based on the Jaccard Similarity of extracted news.

  6. Q6 : Same as Q5, but considering only covid related words

Command format

To view the command format, type help in the Wikipedia Menu. Note dates must be entered in DD-MM-YYYY format only in this menu

Date Format (Yesterday Data & Time Series Menu)

All dates must be entered in this format only. All other formats are ignored. First letter of Date must be in capital only. This is the format followed in WorldOMeters website too

<First three letters of Month Name> <date in double digits>, <year in 4 digits>

Examples:
Jul 02, 2021 
Mar 01, 2020
Feb 15, 2020
...

Date Format (Wikipedia Menu)

Date format in this menu is DD-MM-YYYY only. Example 15-03-2020.

Tip for remembering this, Example ,dd (first character taken from first world, 2nd character taken from second word). If forgot the help command can be used in any menu.

Special Merit

Data is preprocessed to build and index on Dates for faster serving time range based queries (Task 2 Point 3). Index:

Date1 -> [(region, param1, param2, param3...),  (region, param1, param2, param3...) .... ]
Date2 -> [(region, param1, param2, param3...),  (region, param1, param2, param3...) .... ]
...