A scraping utility which retrieves historical and/or forecast weather data in high resolution, from major cities around the world.
Historical weather data is retrieved from Time and Date and enables collection of two weeks worth of hourly weather data.
Forecast weather data is retrieved from BBC weather, enabling collection of up to two weeks worth of hourly data as well.
- Python 3 or higher
- Selenium Web Driver **
- Pandas
- Requests
- Beautiful Soup
- MySQL Connector
**NOTE
This scraping utility uses a headless configuration of Firefox via Selenium, which requires a compatible webdriver to interface with the chosen browser, Firefox. Mozilla Geckodriver needs to be installed before the below examples can be run. Make sure it’s in your PATH, e.g., place it in /usr/bin or /usr/local/bin.
Failure to observe this step will give you the error:Selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH
.
git clone https://github.com/tiguere/ITC-Scrape.git
virtualenv ITC-Scrape
source ITC-Scrape/bin/activate
cd ITC-Scrape
pip install -r requirements.txt
This scraping utility can be manipulated via the following command line arguments:
--days
with default of14
--search_type
with default of both'forecast', 'historical'
--filename
with default ofITC-Scrape/Cities/city_list.xlsx
**
- For historical weather data spanning two days
% python3 main.py --days=2 --search_type="historical"
- For both historical and forecast weather data spanning 14 days, from a file of choice.
% python3 main.py --filename="path/to/file"
NOTE
This file of choice passed as
--filename
argument must:
- be stored in the
Cities
directory- contain a
city
column and acountry
column as headers in the first row
All collected weather data is output into the Database in the corresponding table, according to arguments passed in via the CLI.
**NOTE
filepath ITC-Scrape/Cities/city_list.xlsx is set in
cfg.FILENAME
variable inconfig.py
The installation of this scraping utility provides a relational database which includes three tables:
- Locations
- Forecasts
- Historical Data
Locations:
Id: int NOT NULL AUTO_INCREMENT, PRIMARY KEY
Name: varchar(255), Location name
BBC_Id: varchar(10) NOT NULL, Location number in the url
Historical:
Id: int, NOT NULL, AUTO_INCREMENT, PRIMARY KEY
Scrape_Date: date, Date of the scrape
Date: date, The day which the data scraped
Hour: int, The hour in the specific day
Temperature_C: int, Temperature in celsius
Weather: varchar(30), General description of tha day conditions
Wind_Speed_Kph: int, Wind speed in kilometer per hour
Percent_Humidity: int, Percent of humidity
Pressure_Mb: int, Air pressure in millibar
Visibility_Km: int, Visibility conditions in kilometer.
Location_Id: int, FOREIGN KEY, REFERENCES (Locations.Id)
Forecasts:
Id: int, NOT NULL, AUTO_INCREMENT, PRIMARY KEY
Scrape_Date: date, Date of the scrape
Date: date, The day which the data scraped
Hour: int, The hour in the specific day
Temperature_C: int, Temperature in celsius
Chance_of_Rain: int, Chance percent of precipitation
Wind_Speed_Kph: int, Wind speed in kilometer per hour
Percent_Humidity: int, Percent of humidity
Pressure_Mb: int, Air pressure in millibar
Feels_Like_C: int, The feels like temperature in celsius
Location_Id: int, FOREIGN KEY, REFERENCES (Locations.Id)
Pollution:
Location_Id: int, NOT NULL, FOREIGN KEY REFERENCES (Locations.Id)
Date: date, Date of the scrape
Time: int, The hour in the specific day
CO: float, Сoncentration of CO (Carbon monoxide), μg/m3
NO: float, Сoncentration of NO (Nitrogen monoxide), μg/m3
NO2: float, Сoncentration of NO2 (Nitrogen dioxide), μg/m3
O3: float, Сoncentration of O3 (Ozone), μg/m3
SO2: float, Сoncentration of SO2 (Sulphur dioxide), μg/m3
NH3: float, Сoncentration of NH3 (Ammonia), μg/m3
PM2_5: float, Сoncentration of PM2.5, (Fine Particulate matter), ug/m3
PM10 float, Сoncentration of PM10 (Coarse particulate matter), μg/m3
ERD Diagram :