This project involves an in-depth analysis of the Forest Fires dataset. The dataset includes various features related to forest fire occurrences, such as temperature, wind, rain, and area burned. The analysis includes statistical tests to determine the relationships between these features and visualizations to illustrate these relationships.
The Forest Fires dataset is available from the UCI Machine Learning Repository. It includes attributes like temperature, wind, rain, and the area affected by fires.
Pandas: For data manipulation and analysis.
Matplotlib: For creating static, animated, and interactive visualizations.
Seaborn: For statistical data visualization.
Scipy: For scientific and technical computing, including statistical tests.
A pairplot was generated to visualize the relationships between temp
, wind
, rain
, and area
.
A boxplot was created to show the distribution of the area
affected by fires with respect to the wind
variable.
A lineplot was created to visualize the relationship between rain
and temp
.
A correlation matrix was generated to explore the relationships between the numerical features, excluding month
and day
.
A histogram was created to visualize the distribution of the temp
variable.
The Shapiro-Wilk test was used to assess the normality of the temperature and wind data.
-
Temperature
- Statistic: 0.9868
- p-value: 0.0001256
- Interpretation: The data does not follow a normal distribution.
-
Wind
- Statistic: 0.9673
- p-value: 2.493e-09
- Interpretation: The data does not follow a normal distribution.
The D'Agostino's K^2 test was used to check for normality.
-
Temperature
- Statistic: 9.6946
- p-value: 0.007849
- Interpretation: The data does not follow a normal distribution.
-
Wind
- Statistic: 25.2421
- p-value: 3.301e-06
- Interpretation: The data does not follow a normal distribution.
The Anderson-Darling test was used to check if the data follows a specific distribution.
-
Temperature
- Statistic: 1.8121
- Critical Values: [0.572, 0.651, 0.781, 0.911, 1.084]
- Interpretation: The data does not follow a normal distribution.
-
Wind
- Statistic: 4.3432
- Critical Values: [0.572, 0.651, 0.781, 0.911, 1.084]
- Interpretation: The data does not follow a normal distribution.
A t-Test was conducted to compare the means of two independent groups.
- Statistic: 0.0974
- p-value: 0.9224
- Interpretation: There is no significant difference between the two groups.
A Chi-Square test was used to examine the association between categorical variables.
- Statistic: 64.2383
- p-value: 0.5384
- Interpretation: There is no significant association between the variables.
This test measures the strength and direction of the linear relationship between two continuous variables.
- Correlation: 0.0978
- p-value: 0.0261
- Interpretation: There is a weak but statistically significant positive correlation.
This non-parametric test assesses the strength and direction of the association between two ranked variables.
- Correlation: -0.0242
- p-value: 0.5827
- Interpretation: There is no significant correlation.
This test compares the distributions of two samples.
- Statistic: 0.1124
- p-value: 0.1917
- Interpretation: There is no significant difference in the distributions.
The analysis of the Forest Fires dataset revealed that many of the variables do not follow a normal distribution, as indicated by the Shapiro-Wilk, D'Agostino's K^2, and Anderson-Darling tests. The t-Test and Chi-Square tests showed no significant differences or associations between the tested groups and variables. However, the Pearson correlation test found a weak but statistically significant positive correlation between temperature and area burned. This information can guide further data preprocessing and modeling efforts.