Project Structure
ProjectFolder
ply/*
crawler.ply
wiki_crawler.ply
data_extractor_1.py
data_extractor_2.py
data_extractor_3.py
data_extractor_timeline.py
data_extractor_responses.py
data_extractor_country.py
data_loader.py
model.py
main.py
For ease of use and evaluation the project is zipped as shown above.
- crawler.py : Parses country list file and crawls https://www.worldometers.info/coronavirus/ (saved to home.html) and respective countries (saved to countries/.html)
- wik_crawler.py : Parses covid country list file and crawls https://en.wikipedia.org/wiki/Timeline_of_the_COVID-19_pandemic (saved to wiki_home.html) and respective country pagesare saved to data/countries folder
- data_extractor_1.py & data_extractor_3.py : Extracts information for Task 2 (Point 1) a to j)
- data_extractor_2.py : Extracts information for Task 2 (Point 3)
- data_extractor_3.py : Extracts Worldwide Timeseries information
- data_loader.py : Loads the extracted data into database
- data_extractor_timeline.py : Extracts Wikipedia Covid News
- data_extractor_responses.py : Extracts Wikipedia Covid Responses
- data_extractor_country.py : Extracts Wikipedia Covid Country News
- model.py : Contain class definitions for Database, TimeSeries etc
- main.py : Contains menu, index creation, and synchronization
Install Wordcloud & NLTK
pip install wordcloud NLTK
Just run main.py
<python3_executable> main.py
There are three menus. Yesterday Data Menu, Time Range Menu and Wikipedia Menu. Users can toggle between the menus by entering menu <menu_id> command. At any menu help command can be used to view options available as well as an example. For proper clean up of files, user must type exit to exit the program. Example of one command shown below
- Worldometers Yesterday Day -> Menu 1
- Worldometers TimeSeries Data -> Menu 2
- Worldometers Wikipedia Menu -> Menu 3
Time Range Menu | Enter menu <menu id> to go menu | Enter help for usage
>> menu 1
Yesterday Data Menu | Enter menu <menu id> to go menu | Enter help for usage
Commands parameters must be separated by '|'. Special keywords are used to mention the query used. The following short form has been adopted for the queries
- tc : Total Cases
- ac : Active Cases
- td : Total Deaths
- tr : Total Recovered
- tt : Total Tests
- dpm : Deaths per 1 milion of population
- tpm : Tests per 1 million of population
- nc : New Cases
- nd : New Deaths
- nr : New Recoveries
<regn_name> | <command>
- dc : Daily New Cases
- ac : Active Cases
- dd : Daily Deaths
- dr : Daily Recoveries
<date1> | <date2> | regn | <command>
-
Q1: Shows world wide Covid News & Responses between 2 given dates and plots word-clouds for both
-
Q2: Given two non-overlapping time ranges, do the following: Plots two different word clouds for all the common words (ignoring stopwords) and only covid related common words. Prints the percentage of covid related words in common words (ignore stopwords). Prints the top-20 common words (ignore stopwords) and covid related words.
-
Q3: Given a country, displays date-range for which news is available.
-
Q4 : Given a country, and date-range displays news in this date range, and plots word-cloud for the same.
-
Q5 : Given a country, and a date-range, displays the top 3 closest countries based on the Jaccard Similarity of extracted news.
-
Q6 : Same as Q5, but considering only covid related words
To view the command format, type help in the Wikipedia Menu. Note dates must be entered in DD-MM-YYYY format only in this menu
All dates must be entered in this format only. All other formats are ignored. First letter of Date must be in capital only. This is the format followed in WorldOMeters website too
<First three letters of Month Name> <date in double digits>, <year in 4 digits>
Examples:
Jul 02, 2021
Mar 01, 2020
Feb 15, 2020
...
Date format in this menu is DD-MM-YYYY only. Example 15-03-2020.
Tip for remembering this, Example ,dd (first character taken from first world, 2nd character taken from second word). If forgot the help command can be used in any menu.
Data is preprocessed to build and index on Dates for faster serving time range based queries (Task 2 Point 3). Index:
Date1 -> [(region, param1, param2, param3...), (region, param1, param2, param3...) .... ]
Date2 -> [(region, param1, param2, param3...), (region, param1, param2, param3...) .... ]
...