TH-ET-Internship-Tasks

This is a repository where I'll be uploading tasks/projects assigned to me during my Data Analytics Internship at TechnoHacks EduTech.

  1. Task 1 - Data Cleaning - For this task, I had to download the iconic Titanic dataset from Kaggle and clean it. This dataset is not only historically significant but also a rich source of insights waiting to be uncovered. I had two approaches for this task. One was to remove missing values by deletion. I deleted the column where missing values were more than 75% and then used the dropna() method to remove the null values. This resulted in the size reduction of the dataset. My other approach was to impute the missing values. I filled the missing values using fillna(). I filled the missing values in the cabin column with 'unknown', filled the missing values in the age column with the mean age and filled the missing values in the Embarked column with the most common value. This did not result in the reduced size of the dataset and the number of rows and columns remained intact.

  2. Task 2 - Summary Statistics - For this task, I had to download the iconic Titanic dataset from Kaggle and calculate it's summary statistics. I first did the basic operations like loading the csv file into a dataframe and using the dtypes and nunique() method to get an idea about the datatypes and the unique values in the columns. Then, I divided my task into two parts. First was to calculate the summary statistics of the numeric columns using the default describe() method. But this method didn't calculate the mode. So I used mode(numeric_only=True).iloc[0] to calculate the mode of the dataset. In the second part of my task, I decided to calculate the summary statistics of the non-numeric columns using describe(include=['O']). Thus, I completed my second task of this Data Analytics internship.

  3. Task 3 - Removing Duplicates.- For this task, I had to remove duplicate values from an iris dataset. But as that dataset didn't have anay duplicate values, I downloaded the reserved sample dataset (employee dataset). I then used essential pandas function like head(), info(), describe(), etc. to gain understanding of the dataset. I changed the date columns to date-time format using pd.to_datetime(). I also converted the salary column to int by removing '$' and ','. I also changed the datatype of the bonus_% column to int. I knew that while some job titles might be repeated, they cannot be considered duplicates as many people can have the same job. We have to identify a column which has unique value for each employee, which in this case is the employee id (eeid). Then I used the duplicated() method with eeid as the value for its subset argument to get a series of boolean values indicating duplicate values. I also calculated the number of duplicated values using df.duplicated.(subset=['eeid'], keep='first').sum(). I then dropped the duplicated values using df.drop_duplicates(subset=['eeid'], keep='first'). Thus, I completed the third task of the Data Analytics Internship.