Table of Contents
- Exercise 1: Find Baby Names
- Exercise 2: Find Email Addresses
- Exercise 3: Find Twitter Usernames
- Exercise 4: Parse Access Logs
Assemble 134 years of baby names pulled from Social Security Card Applications-National Level Data. The data is split by year into 134 different text files and should be added to MongoDB in the following format:
{
"_id" : "1948",
"info" : [
{
"name" : "Linda",
"num_occurrence" : "96210",
"sex" : "F"
},
{
"name" : "Zell",
"num_occurrence" : "5",
"sex" : "M"
}
]
}
Navigate to regex_grounds and execute the following:
python exercise_1/add_to_mongo.py
- Make sure MongoDB is running on localhost
- Run
pip install -r requirements.txt
to install relevant modules
/names directory is pulled from:
https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data
Parse https://www.data.gov/contact and return a list of all email addresses. Do not return any duplicates.
Navigate to regex_grounds and execute the following:
python exercise_2/find_addresses.py "url"
Multiple urls are accepted and can be passed as such:
python exercise_2/find_addresses.py "url" "url2" "url3"
- Run
pip install -r requirements.txt
to install relevant modules
https://www.data.gov/contact should only return one email address.
http://www.fightthescams.com/2014/12/04/fake-job-postings-on-craigslist/ should return 4498 email addresses.
Take a look at at the Twitter username creation guidelines and return a list of all twitter usernames on the page.
https://support.twitter.com/articles/101299-why-can-t-i-register-certain-usernames
As in Exercise 2, do not return any duplicates.
Navigate to regex_grounds and execute the following:
python exercise_3/find_usernames.py "url"
Multiple urls are accepted and can be passed as such:
python exercise_3/find_usernames.py "url" "url2" "url3"
- Run
pip install -r requirements.txt
to install relevant modules
https://support.twitter.com/articles/101299-why-can-t-i-register-certain-usernames should return 4 usernames. https://www.data.gov/contact should only return 5 usernames.
Split the following Nginx access log into its components:
192.168.1.12 - - [23/Jun/2015:11:10:57 +0000] "GET /entry/how-create-configure-free-ssl-certificate-using-django-and-pythonanywhere HTTP/1.1" 302 5 "http://www.reddit.com/r/Python/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.18 Safari/537.36" "192.168.1.12"
The components are:
$http_x_real_ip - [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_x_forwarded_for"