Data was gathered from 3 different sources:
- From WeRateDogs Twitter archive given by Udacity in csv format.
- Image prediction file downloaded programmatically using Requests library and the URL provided by Udacity in tsv format.
- Data retrieved by querying Twitter's APIs and using Tweepy library.
After gathering the data and storing them in DataFrames, the following step was assessing the data for quality and tidiness.
It is the process of fixing and resolving issues identified in the Cleaning process. The (define, code, and test) steps were used in the cleaning process. First, copies of the DataFrames were created before cleaning. Then, the steps of cleaning were applied iteratively on all issues.
The final DataFrame called 'twitter_archive_clean' with the correct data types. The dataset is then stored in a csv file called 'twitter_archive_master.csv'. At this point, the data was successfully wrangled and therefore ready for analysis and visualization.
These steps are not part of data wrangling process. However, it cannot reflect correct and accurate insights without performing data wrangling first. Visualizations and insights are provided in ‘act_report.pdf