/exno1

data process

Primary LanguageJupyter Notebook

Exno:1 Data Cleaning Process using Python

AIM

To read the given data and perform data cleaning and save the cleaned data to a file.

Explanation

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect ,incompleted , irrelevant , duplicated or improperly formatted. Data cleaning is not simply about erasing data ,but rather finding a way to maximize datasets accuracy without necessarily deleting the information.

Algorithm

STEP 1: Read the given Data

STEP 2: Get the information about the data

STEP 3: Remove the null values from the data

STEP 4: Save the Clean data to the file

STEP 5: Remove outliers using IQR

STEP 6: Use zscore of to remove outliers

CODING AND OUTPUT:

            Developed by: Sathish R
            Regno: 212222100048

1) Read and display DataFrame

import pandas as pd
df=pd.read_csv('/content/SAMPLEDS.csv')
df

OUTPUT:

Output

2) Display head

df.head()

OUTPUT:

Output

3) Display tail

df.tail()

OUTPUT:

Output

4) Info of dataframe

df.info()

OUTPUT:

Output

5) Describe about the dataframe

df.describe()

OUTPUT:

Output

6) Shape of the dataframe

df.shape

OUTPUT:

Output

7) Checking tha NUll values

df.isnull().sum()

OUTPUT:

Output

8) Drop the Null values

x=df.dropna(how='any')
x

OUTPUT:

Output

9) Drop the Null values in Total

tot=df.dropna(subset=['TOTAL'],how='any')
tot

OUTPUT:

Output

10) FIll the Null values

df.fillna(0)

OUTPUT:

Output

11) Finding the mean value

mn=df.TOTAL.mean()
mn

OUTPUT:

Output

12) Final output

for x in df.index:
  if df.loc[x,"AVG"]>100:
    df.drop(x,inplace=True)
df

OUTPUT:

Output

14) Outlier detection and removal

import pandas as pd
import seaborn as sns
age=[1,3,28,27,25,92,30,39,40,50,26,24,29,94]
dff=pd.DataFrame(age)
dff

OUTPUT:

image

15) Boxplot

dsf=sns.boxplot(dff)

OUTPUT:

image

16) Scatterplot

dsf=sns.scatterplot(dff)

OUTPUT:

image

17) IQR

q1=dff.quantile(0.25)
q2=dff.quantile(0.5)
q3=dff.quantile(0.75)
iqr=q3-q1
iqr

OUTPUT:

image

18) Checking the high and low value

low=q1-1.5*iqr
low
high=q3+1.5*iqr
high

OUTPUT:

image

image

19) Filtering outlier value

dff=dff[((dff>=low)&(dff<=high))]
dff

OUTPUT:

image

20) Dropping the null value

dff.dropna()

OUTPUT:

image

21) Box plotting after filtering outlier

sns.boxplot(data=dff)

OUTPUT:

image

22) Z Score

import pandas as pd
import seaborn as sns
import numpy as np
from scipy import stats
data={'weight':[12,15,18,21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57,60,63,66,69,202,72, 75, 78, 81, 84, 232, 87, 90, 93,96,99,258]}
ds=pd.DataFrame(data)
ds

OUTPUT:

image

23) Z Score

import pandas as pd
import seaborn as sns
import numpy as np
from scipy import stats
data={'weight':[12,15,18,21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57,60,63,66,69,202,72, 75, 78, 81, 84, 232, 87, 90, 93,96,99,258]}
ds=pd.DataFrame(data)
ds

OUTPUT:

image

24) Z Score

sns.boxplot(data=ds)

OUTPUT:

image

25) Z Score

z=np.abs(stats.zscore(ds))
z

OUTPUT:

image

26)Z score

print(ds[z['weight']>3])

OUTPUT:

image

Result

Hence the data was cleaned , outliers were detected and removed.