/Udacity-DAND-Data-Wrangling-Project

Udacity Data Analyst Nanodegree Project 4

Primary LanguageJupyter Notebook

Udacity-DAND-Data-Wrangling-Project

Udacity Data Analyst Nanodegree Project 4

This project is about wrangling data of the twitter account WeRateDogs, Wrangling is a process which consists of three stages, 1. Gathering 2. Assessing 3. Cleaning. Here gathering of the data was done from three sources. Initially data of twitter archive of the account was gathered easily as it was already collected by Udacity, this csv file contains basic information of the tweets which are 5000+ tweets till August 1, 2017, basic information like its timestamp, ID, twitter text etc. Second file consists of the data like predicting the breed of the dogs through image, it had convolutional neural network which analyses image of dogs and predict their breed. This file was also provided by the Udacity so it was easy to gather, it was required to save this file so it was saved programmatically using OS and requests libraries. Third file of data was hardest to gather, here we needed to query the data through twitter API from the tweet_id provided in the twitter archive, this query consists extra information regarding tweets and the most useful of this information is retweet count and favorite count. Gathering the information from the third file was the most challenging as I had no prior experience in dealing with twitter API, but in this case one of the thing mentioned by course instructor David Venturi played out to be very vital which was good searching skills, fortunately after spending few hours searching I was able to gather the required documents to gather data through API, which I have mentioned in the Jupyter Notebook. Here gathering of the data was concluded, for me gathering the data was the most difficult step especially finding relevant materials to deal with twitter API. After gathering the data, time for assessing and cleaning it came, I assessed the data both visually and programmatically, I applied all the method on all the three dataframes which I know which will be useful for assessing the data. I then mentioned the quality and tidiness issues of all three dataframes, assessing was comparatively easier step as just revisited the assessing chapter of the course and followed the steps. Most of the quality issues were regarding the datatype issues and most of tidiness issues were regarding merging different columns into one column or one column to many columns, while one quality error was whenever rating was given in decimal only the digits after the decimal were included in rating which was corrected afterwords using regular expressions. In cleaning the method like replace(), drop(), strftime(), notnull(), reindex(), pd.merge() etc. were used. Here In summary I enjoyed doing this project, gathering JSON data through twitter API was the most challenging part.