Real-world data rarely comes clean. Using Python and its libraries, I will gather data from three of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. I will be documenting my wrangling efforts in this Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries)
I gathered three different data from three different data sources:
- WeRateDogs Twitter archive downloaded Manually from Udacity. To get the file "(twitter_archive_enhanced.csv)” into my notebook after downloading from Udacity, I read it using pandas "read_csv" function and stored it as "t_archive". Choose “t_archive” so it would be easier to work with using a short name.
- tweet image predictions file downloaded programmatically using requests library from Udacity's provided URL. I was provided a URL which stored the file "(image_predictions.tsv)". And to get that, I used the Requests library "get" function to retrieve the file from the URL provided. The content of the file was read into a python data from using pandas and stored as pred. Using a short name again.
- Additional data retrieved by querying Twitter's API using Tweepy libary. Here comes the tricky part. After getting the required credentials from Twitter to use the Twitter API. I queried each tweet Id stored in t_archive for "favorite count" and "retweets" using a for loop and storing the result if the tweet Id was found in"tweets_id_data" while those that were not found were stored in "notweets_id_data". Saved the final result after in a file called "tweets_json.txt" and stored it as “tweet_data” in my notebook.
- The process took 3047 seconds to complete
- 2296 tweet id found
- 60 tweet id not found
-
t_archive data - Timestamp column is in object format
-
t_archive data - Lots of dog rating numerator greater than 15
-
t_archive data - Lots of dog rating denuminator greater than 10
-
t_archive data - The source column needs cleaning
-
t_archive data - Colunms with empty values have None as a value which is not showing as Null or NaN(null object not null)
-
pred data - The names in the dog breed predictions are not uniform some in lower case while others in mixed case
-
t_archive data - Some dogs have invalid names
-
pred data - Three dog breed prediction available
-
all_merged- Retweets and reply present
-
all_merged- Unnecessary columns
-
all_merged - Tweets without image
-
all_merged - Wrong assigned datatype
-
t_archive - Dog stage is scattered in 4 colunms
-
General - The three dataset have tweet id in common
- How is the dog stage distributed in the data
- What are the top 10 dog breeds
- Source of tweet distribution
- Visualize the high correlation between favorites and retweets counts
- Dog rating distribution
-
"Pupper" was the most common dog stage having a whooping 66.6% of the dog stage.
-
The most popular dog breed goes to "Golden retriever"
-
Iphones dominated in the source of tweets
-
The more the favourite count the more retweet the post gets
-
12/10 is the most common rating given to dogs rated by WeRateDogs.