Real-world data rarely comes clean. Using Python and its libraries, this project aims to gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it.
The dataset that we will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates
, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10.
The project consists of the following files:
wrangle_act.ipynb
: code for gathering, assessing, cleaning, analyzing, and visualizing data.wrangle_report.pdf
: documentation for data wrangling steps: gather, assess, and clean.act_report.pdf
: documentation of analysis and insights into final data.twitter_archive_enhanced.csv
: file containing the original data before wrangling, as given.image_predictions.tsv
: file downloaded programmatically.tweet_json.txt
: file constructed via Twitter API.twitter_archive_master.csv
: combined and cleaned Tweets information data.image_predictions_clean.csv
: combined and cleaned image prediction data.