Data Science Fundamentals Prepared by Taha Er
- Get to know the dataset
- Summary statistics (mean, median, std, etc.)
- Data distribution and visualization (histogram, boxplot, scatter plot, etc.)
- Identify and impute missing values
- Detect and handle outliers
- Correct erroneous or inconsistent data
- Remove unnecessary or redundant features
- Apply transformations such as log, ln, 1/x, etc.
- Scaling (StandardScaler, MinMaxScaler, etc.)
- Label Encoding
- One-Hot Encoding
- Target Encoding
- Create new features (feature generation)
- Create feature interactions
- Generate time-based features for time series data
- Feature importance scores
- Recursive Feature Elimination (RFE)
- Principal Component Analysis (PCA)
- Create training and test sets (train-test split)
- Apply data augmentation if necessary
- Balance the dataset (using methods like SMOTE)
- Imputing missing values and encoding categorical features are steps that can significantly impact model performance.
- Applying transformations can help normalize data distributions, aiding better model learning.
- It is beneficial to re-explore the data after all these steps to ensure it is ready for modeling.