Assignment 1

Data Loading and Initial Analysis

Loading the Dataset and Dictionary:

We began by loading the main dataset from a CSV file and its corresponding data dictionary from an Excel file. The dataset contains information on customer transactions, while the data dictionary explains each feature's purpose.

data_df = pd.read_csv('/path/to/Project1Data.csv')
data_dict_df = pd.read_excel('/path/to/Data_Dictionary_Project1.xlsx')

The columns are:

Column Name
CustomerID
Has_Transaction_Nov23_May24
Total_Spent_Nov23_May24
Total_Spent_1M
Transaction_Count_1M
Total_Spent_3M
Transaction_Count_3M
Total_Spent_6M
Transaction_Count_6M
Total_Spent_9M
Transaction_Count_9M
Total_Spent_12M
Transaction_Count_12M
Total_Spent_AllTime
Transaction_Count_AllTime
Brand Description_Anthelios
Brand Description_Bundle
Brand Description_Cicaplast
Brand Description_Eau Thermale
Brand Description_Effaclar
Brand Description_Hyalu B5
Brand Description_Lipikar
Brand Description_Niacinamide
Brand Description_Retinol LRP
Brand Description_Serozinc
Brand Description_Toleriane
Brand Description_Uvidea
Brand Description_Vitamin C
Class Description_Anti-Acne
Class Description_Anti-Ageing
Class Description_Body Care
Class Description_Bundle
Class Description_Face Care
Category_Body Care
Category_Face Care
Category_Face Care
Category_Sun Care
Sub-Category_Body Moisturiser
Sub-Category_Body Wash
Sub-Category_Eye Cream
Sub-Category_Face Cleanser
Sub-Category_Face Mask
Sub-Category_Face Moisturiser
Sub-Category_Face Serum
Sub-Category_Sunscreen
Sub-Category_Tinted Sunscreen
Sub-Category_Toner & Mist
Skin Concern_Acne-Prone Skin
Skin Concern_Anti-Ageing
Skin Concern_Irritation-Prone Skin
Skin Concern_Pigmentation and Dark Spots
Skin Concern_Sun Protection
EAN_TolerianeMoisturiser40ml
EAN_HyaluB5Serum30ml
EAN_EffaclarMoisturiser40ml
EAN_VitaminC10Serum30ml
EAN_RetinolB3Serum30ml
EAN_EffaclarSerum30ml
EAN_AntheliosInvisibleSunscreen50ml
EAN_Pure10NiacinamideSerum
EAN_CicaplastB5BaumeBothSKUs40ml

Data Cleaning:
- After loading the data, we removed columns that were marked as "Remove" in the data dictionary. These columns were deemed unnecessary or irrelevant for our analysis, so they were excluded from the dataset.
```
columns_to_remove = data_dict_df[data_dict_df['Type'] == 'Remove']['Variable'].tolist()
cleaned_data_df = data_df.drop(columns=columns_to_remove)
```
Distribution of Transaction Values:
- We visualized the distribution of the Total_Spent_Nov23_May24 variable, which represents the total amount spent by customers between November 2023 and May 2024. This histogram helps us understand the spread and skewness of the transaction values, providing insights into how most customers are spending.
```
plt.figure(figsize=(10, 6))
sns.histplot(cleaned_data_df['Total_Spent_Nov23_May24'], kde=True, bins=30)
plt.title('Distribution of Total Spent (Nov23-May24)')
plt.xlabel('Total Spent')
plt.ylabel('Frequency')
plt.savefig('Distribution of Total Spent (Nov23-May24).png')
```

Correlation Analysis:

A function named correlation_calculate() was defined to compute the correlation matrix of the cleaned dataset. The function specifically identifies the top 15 features most correlated with Total_Spent_Nov23_May24.
We visualized these correlations using a heatmap, highlighting the features that are most predictive of the total amount spent by customers. This heatmap serves as a crucial step in identifying the most important predictors for our subsequent analysis.

def correlation_calculate():
    correlation_matrix = cleaned_data_df.corr()
    top_correlations = correlation_matrix['Total_Spent_Nov23_May24'].sort_values(ascending=False).head(15)
    
    plt.figure(figsize=(8, 10))
    sns.heatmap(correlation_matrix[['Total_Spent_Nov23_May24']].sort_values(by='Total_Spent_Nov23_May24', ascending=False), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
    plt.title('Top Correlations with Total Spent (Nov23-May24)')
    plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight')
    plt.show()

    return top_correlations

UTSJiyaoLi/Ass1_24s2

Assignment 1

Data Loading and Initial Analysis

Feature A: `Sub-Category_Face Serum`

UTSJiyaoLi/Ass1_24s2

Assignment 1

Data Loading and Initial Analysis

Feature A: Sub-Category_Face Serum

Feature A: `Sub-Category_Face Serum`