Create some sort of web scraper and host in on AWS. For the purpose of this the target of scraping will be some sort of simple API.
First task is to scrape something. For the simplicity, we will scrap weather data at University of Waterloo.
We will record the data everyday at 00:00
, 06:00
, 12:00
and 18:00
to be able to view the history of weather evolution. srape.py
does this with the help of dailyScript.sh
.
You can substitute this with anything you like of course.
We will launch this script on AWS EC2 instance.
-
Go to Amazon.
-
Go to
Launch Instance
. -
Select
Amazon Linux
and choose at2.micro
~ Free tier eligible andLaunch
!` -
Create a new key pair
andLaunch Instance
-
Connect
to instance following appropriate instructions.
Once you are on your EC2 run (you would have your own repository):
sudo yum upgrade
sudo yum install git
git clone https://github.com/mannyray/automaticWebScraper.git
cd automaticWebScraper
You now want to set up a Cron Job (a time-based program) to scrape data on a regular (daily for us) basis.
crontab -e
and add:
CRON_TZ=America/New_York
0 0,6,12,18 * * * cd /home/ec2-user/automaticWebScraper; ./dailyScript.sh
This will take the weather at 6 hour periods. every day at Waterloo time zone (TZ). (Don't forget to make dailyScript.sh
an executable by chmod u+x dailyScript.sh
)
Since this scrape procedure is run on AWS then the user needs to know if something goes wrong. One way this can be done is by setting up a Google App Script:
-
Open up your Google Drive.
-
Create a new Google App Script called
AlertMessage
-
Add the code to
code.gs
function doGet(request) { MailApp.sendEmail("your@gmail.com", request.parameter.subject, request.parameter.message); var result = { sent: 0 == 0 }; return ContentService.createTextOutput(JSON.stringify(result)) }
Where
your@gmail.com
is your Google email account on which you made the Google App Script. -
Publish > Deploy as web app...
. -
Who has access to the app:
Anyone, even anonymous
-
Press
Deploy
and give the appropriate permissions. -
Copy the full current web app URL:
https://script.google.com/macros/s/.../exec
(...
is the Google generated portion) -
On your computer terminal you can now send notifications via:
curl -L "https://script.google.com/macros/s/.../exec?subject=TITLE&message=UNDERSCORE_FOR_SPACES_INSIDE_MESSAGE"
which will cause your Gmail inbox to get a new message.
We will add the curl
line to the script.sh
line.
In addition to issuing warnings to self, backup is another important task. Since the data collected is fairly small then backup on github is OK.
The following will allow you to run your github without entering your password. This will allow automatic backup.
- Create ssh keys:
(just press enter all the way through)
cd ~ ssh-keygen -t rsa
- Go to
github.com > settings
and copy the contents of your~/.ssh/id_rsa.pub
into the field labeled 'Key'. git remote set-url origin git+ssh://git@github.com/username/reponame.git
The backup code is located within dailyScript.sh
.
We want to be able to check on our data once in a while and see it visually or be able to ask different queries. One way to do this is to setup a server that will only be accessible to us privately. This part is not necessary and not urgent to install since the only thing we care about at the end of the day is just the data. This portion can be done later.
Inside your server:
sudo yum update –y
sudo yum install -y httpd24 php56 mysql56-server php56-mysqlnd
sudo service httpd start
Once you have done this, you can view the default web page via: ssh -L 3000:localhost:80 -i "amazon_scrape.pem" ec2-user@(...)something.compute.amazonaws.com
. In your local browser you can go to http://localhost:3000/
to view the web page.
Run the following command to have server turn on every boot.
sudo chkconfig httpd on
Edit editing permisions:
sudo groupadd www
sudo usermod -a -G www ec2-user
sudo chown -R root:www /var/www
sudo chmod 2775 /var/www
find /var/www -type d -exec sudo chmod 2775 {} +
find /var/www -type f -exec sudo chmod 0664 {} +
Database creations:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-LAMP.html
sudo service mysqld start
sudo mysql_secure_installation
Default is no password so press enter and then press Y
for a new password (and enter it) as well as Y
for all other options.
Now we are going to work on populating server: https://stackoverflow.com/questions/21033459/extracting-data-from-txt-files-to-import-in-a-mysql-database-with-php
- First define your scheme:
For this case we will store:(YEAR,MONTH,DAY,HOUR,TEMPERATURE,ID)
. WhereID
will be data in string format:YYYY_MM_DD_HH
Runningdatabase_setup/create_table.sql
will create the table with that scheme withmysql -u root -p < database_setup/create_table.sql
where your password would be the one defined earlier. - Loading preexisting data. In the case that you end up setting up your database after saving some data in text files then you should run
database_setup/loadData.sh
in root directory of repository. - Loading fresh data into database.
- ssh to your server
sudo vi /etc/ssh/sshd_config
. Find the line#GatewayPorts no
and change the no to yes. Save and exit- Restart the daemon:
sudo /etc/init.d/sshd restart
- To connect to your server assuming you have a key:
ssh -i "ssh -L 3000:localhost:80 -i "yourkey.pem" ec2-user@ec2-(...)something.compute.amazonaws.com
More additional details can be found at http://szonov.com/programming/2018/12/06/simple-scraper/
Don't want to you overuse your data or else you will be charged extra money.
Talk about inbound/outbound rules.