This is a python package that performs exploratory data analysis for users. It takes in a csv file and generates 3 documents that comprise of a text report containing a descriptive summary, a series of plots and a cleaned csv output.
- Python 3.8x
- matplotlib==3.1.2
- numpy==1.18.1
- pandas==1.0.0
- PySimpleGUI==4.19.0
- scikit-learn==0.22.1
- scipy==1.4.1
- seaborn==0.10.0
- statsmodels==0.11.1
- more-itertools==8.3.0
-
You can clone or download my package.
-
Using terminal, move to the directory.
- Example for Mac OS users:
$ cd Downloads/Edator
-
Install the required packages using:
pip install -r requirements.txt
-
After that, change directory into the Script folder using:
$ cd Script
-
Now, execute the main.py file by:
$ python main.py
-
Choose the csv file, the path to export the plots, the report and the cleaned csv file to.
-
Done!
How I deal with NaN value is that I only remove the affected rows when the percentage of NaN within that column is less than 5%. This applies to both numerical and categorical values. For anything above 5%, I replace the NaN values with median. For categorical values, the NaN values will be replace by mode.
Dealing with zeros is much harder as it is challenging to differentiate between a zero that is meaningful (has a purpose and should not be removed) and a zero that serves no purpose and can potentially add more noise to the dataset. Hence, I decided to inform the user about the percentage of zeros in the dataset.
I use Z-score to detect outliers. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.
In most cases, a threshold of 3 or -3 is used to filter off outliers and I have used this approach for all of my analysis.
For correlation, I included:
- Pearson and Spearman correlation for numerical-numerical variables.
- One Way ANOVA for numerical-categorical variables
- Chi-Square test for categorical-categorical variables
Using itertools.combinations, I identify every possible combinations among numerical-numerical variables, numerical-categorical variables and categorical-categorical variables. I then apply the correlation test based on the criteria I have set above.
For plots, I created:
- Scatterplot for numerical variables
- Countplot for categorical variables
- Boxplot for numerical-categorical variables
Similar to correlation, I used itertools.combinations to create every possible plot. I have also added the hue feature to each scatterplot. I will only do so when the categorical variable has less than 5 unique values. Example, if hue = "fruits", I should only see 4 types of fruits.
- Upon obtaining sufficient feedback on this script, I will register this package in PyPI to streamline installation.
- Instead of generating txt reports, I will utilise HTML and Bootstrap to generate a much more appeasing look.