Project Title: Wrangle and Analyze Data from @WeRateDogs
May 17, 2022
May 27, 2022
- Main Project File: Every data analysis step is documented in this file, from gathering data to analyzing and visualizing data
- Act Report: This report contains a brief exploratory data analysis, based on three research questions and their respective visualizations
- Wrangle Report: In this document, there is a detailed walkthrough of my data wrangling efforts, which was the focus of this endeavor
data/
: This folder contains the datasets used in the project, and a few screenshots from selected tweets I used in the Act Report.
This project was undertaken as part of Udacity's Data Analyst Nanodegree.
For this project, the goal was to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning was required for "Wow!"-worthy analyses and visualizations.
For this project, the gathering data phase involved three datasets:
-
archive
table: contains the information WeRateDogs sent Udacity and I downloaded directly. -
fav_rt
table: contains the information I gathered from Tweepy. Namely, tweet ID, favorite count and retweet count. -
image_prediction
table: WeRateDog's tweets were processed through a neural network that can identify dog breed. This dataset was downloaded programmatically.
After gathering data from multiple methods, I moved on to assessing and cleaning data.
I ended up with the following list of issues:
source
column includes the HTML tag- Erroneus datatypes (
timestamp
,in_reply_to_status_id
,in_reply_to_user_id
,retweeted_status_user_id
,retweeted_status_id
,retweeted_status_timestamp
) name
column has 109 rows with invalid, lower case wordsname
column has 705 rows in which a dog name wasn't listed but instead of a null value, there is a None string as an entry instead- In the
text
column, because of the string quotation marks, the url link breaks if one tries to click on it doggo
,pupper
,puppo
,floofer
: columns have rows in which a dog stage wasn't listed but instead of a null value, there is a None string as an entry insteaddoggo
,pupper
,puppo
,floofer
: erroneous data type, string instead of bool- Retweets should be dropped
p1
,p2
,p3
: dog breeds are separated by '_' instead of spaces ' 'p1
,p2
,p3
: some dog breed are not capitalized- The neural network apparently identified that some images are not from dogs
- The dataset could be tidily represented with the columns
favorite_count
andretweet_count
in thearchive
table - Four variables in one column in the
archive
table (dog_stage
, i.e., doggo, floofer, pupper, and puppo) - The dataset could be tidily represented joining the
image_prediction
table into thearchive
table to create one master dataset
For detailed information on how each issue was cleaned, please check the main project file or the brief report about my wrangling efforts.
Once the master dataset was cleaned, I moved on to a brief exploratory data analysis.
With the information available, I came up with three research questions:
The top 5 most favorited breeds are: Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, and Samoyed.
This insight was only possible after piercing together and cleaning the master dataset. I had to group by dog breed and then get the sum of favorited tweets. Both of those columns were originally in separate datasets.
The Golden Retriever sits comfortably in the Top 1. Further analysis could entail figuring out whether it was the most favorited dog breed simply due to the fact a disproportionate amount of tweets picture a dog from this breed.
For this visualization, I grouped by dog breed and found the amount of times tweets with a specific dog breed was favorited.
2.
Does the dog stage influence the number of interactions a tweet receive (favorites and retweets)?
According to WeRateDog's dogtionary, the four dog stages are: doggo, pupper, puppo and floofer.
One might think that puppers (small dogs, usually younger) get the most interactions, but the data suggests that the number of interactions puppers received are almost the same as doggos (a big pupper, usually older).
In a distant third, comes the puppos (a transitional phase between puppo and doggos). Lastly, the floofers (any dogs with seemingly excess fur).
As a limitation, it's worth mentioning that only a small number of tweets from the sample included the dog stage. If we had dog stage information for the whole sample, the results might have been different.
This analysis was only possible after tidying the dog stage columns and turning them into one.
For this visualization, I grouped by dog breed and got the sum of what I called interactions. For the purpose of this analysis, interactions are the sum of the amount of times a tweet was favorited and the amount of times a tweet was retweeted.
Although we only have tweet information from 2016 to 2017, the data seem to suggest the average rating for dogs is increasing through time.
Without converting the timestamp column to datetime, this analysis wouldn't be possible.
For this visualization, I grouped by trimester and found the average rating for tweets with only one dog. Then, I plotted a line chart.
For detailed information on how each analysis was performed and their respective visualizations, please check the main project file or the brief report about the exploratory data analysis.