/Data_Wrangle

Primary LanguageJupyter Notebook

Data_Wrangle

My wrangle report

From my point of view so far, I will say data wrangling is the back bone of data analysis. Without it, the insights and explorations generated from it are most likely going to be misleading. It is undeniable that most datasets are dirty and untidy (they are not same). While dirtiness is all about quality issues, tidiness is more of its structure. This data wrangling project aims to test my knowledge in the aspect; gathering, assessing, cleaning and storing of data (although not necessary). A dataset was given and and the objective of the project is to look into the dataset, gather some other data from other sources, assess and clean the, and then combine them together to give a neat and high quality dataset.

The first step began with gathering. I imported necessary packages needed, and also loaded the 'twitter-archive-enhanced.csv" dataset. I also programmatically downloaded the the twitter images prediction file. I completed the gathering phase with directly scraping from twitter API. Although I couldn't get access, I had to go by the alternative provided, which is directly scraping from a JSON text file.

The second step which is assessing data I have gathered so far. I am expected to check throught he datatsets visually and programmitcally, find out any issues in terms of quality and tidiness and document them. I was able to detect 8 quality issues, mostly from the twitter archive enhanced dataset, and 2 tidiness issues, one from twitter archive, and the other which is the need for me to join the data sets. One has to be patient, calm and observant while assessing the datasets.

The third and final wrangling step which is cleaning the documented issues from the earlier assessments. I followed the expected cleaning ethics which is "Define" ---> "Code" ---> "Test" method of cleaning. Defining entails writing down how I am going about the cleaning, coding is the real cleaning part, and testing entails if the code part worked as expected. Although I didn't take the cleaning process in accordance to how I listed them in the documentations, this was because I realized some cleaning process needs to be done before the others.