This project explores the complexities of gun violence in North America by utilizing advanced statistical techniques and machine learning algorithms. The analysis aims to uncover patterns, characteristics, and relationships within a comprehensive dataset of gun-related incidents. This study employs K-Means Clustering and Principal Component Analysis (PCA) to identify distinct clusters and trends, offering a detailed understanding of the incidents to inform preventive strategies.
- Introduction
- Data Description
- Feature Engineering
- Model Selection & Methodology
- Results
- Conclusion & Future Scope
- How to Run
- License
Gun violence is a significant societal issue that requires a thorough understanding of incidents to inform preventive strategies. This study uses a dataset of gun-related incidents to extract valuable insights. Through advanced statistical techniques and visualizations, the analysis aims to uncover patterns, characteristics, and relationships behind recorded incidents, extending previous efforts in curating this dataset.
- Identify patterns, trends, and clusters using feature engineering and clustering.
- Provide insights to guide targeted interventions.
- Contribute to informed decision-making for mitigating gun violence.
The dataset, sourced from Kaggle, includes over 260,000 US gun violence incidents from 2013 to 2018. It provides detailed information on each incident, such as weapon type, number of victims, and relationships among participants.
The data was sourced from Kaggle, titled “Gun Violence Data Comprehensive record of over 260k US gun violence incidents from 2013-2018”. It includes detailed information on each incident, such as weapon type, number of victims, and relationships among participants.
Key features engineered from the dataset include:
- Lethality: Weapon type and its lethality.
- Participants: Number of victims and suspects.
- Kill-Death-Assist: Number of injured, killed, unharmed arrested, and unharmed participants.
- Age Profile: Classification of participants by age group.
- Gender Ratio: Female percentage of participants.
- Relationship Status: Relationships between perpetrators and victims.
Three clustering algorithms were tested: DBScan, K-Median Clustering, Hierarchical Clustering, and K-Means Clustering. The final model selection was K-Means Clustering combined with PCA to reduce dimensionality.
The optimal number of clusters was determined through iterative testing and analysis of the total weighted sum of squares and gap statistics. The best value of 'K' was found to be 9.
The application of PCA and K-Means clustering revealed distinct groupings within the gun violence dataset. The identified clusters provide a granular understanding of the incidents, ranging from smaller-scale urban conflicts to organized crime and extreme outlier events.
- Cluster 1: Urban Turbulence - Smaller-scale urban conflicts.
- Cluster 2: Ruthless Warfare - High gun involvement in organized crime.
- Cluster 3: Stealthy Offenders - Strategic criminal activities with firearms.
- Cluster 4: Extreme Outliers - Rare incidents with exceptionally high gun use.
- Cluster 5: Strained Relationships - Conflicts among acquaintances with guns.
- Cluster 6: Domestic Disturbance - Incidents within families.
- Cluster 7: Isolated Incidents - Infrequent events with moderate violence.
- Cluster 8: Tense Workplace - Conflicts in professional settings with guns.
- Cluster 9: Teen Turmoil - Conflicts involving teenagers with moderate violence.
This comprehensive analysis of gun violence incidents in the United States successfully utilized advanced statistical techniques and clustering methodologies to extract meaningful insights. The findings advocate for targeted interventions and future extensions of the model to further inform preventive strategies.
- R (version 4.0 or later)
- RStudio (optional but recommended)
- Required R packages:
dplyr
,factoextra
,purrr
,ggplot2
,reshape2
,stargazer
,cluster
,corrplot
-
Clone the repository:
git clone https://github.com/arhamanwar/gun-violence-analysis.git cd gun-violence-analysis
-
Install the required packages:
install.packages(c("dplyr", "factoextra", "purrr", "ggplot2", "reshape2", "stargazer", "cluster", "corrplot"))
-
Load the dataset and preprocess:
# Load necessary libraries library(dplyr) library(factoextra) library(purrr) library(ggplot2) library(reshape2) library(stargazer) library(cluster) library(corrplot) # Read and preprocess the dataset file_path <- "data/gun_violence.csv" gun <- read.csv(file_path) gun <- subset(gun, n_guns_involved != 0) gun <- na.omit(gun) # Feature engineering source("scripts/feature_engineering.R") # Save preprocessed data write.csv(gun, file = "data/gun_preprocessed.csv", row.names = FALSE)
-
Perform clustering analysis:
# Load preprocessed data gun <- read.csv("data/gun_preprocessed.csv") # Clustering and PCA source("scripts/clustering_analysis.R")
-
Generate visualizations and results:
# Generate visualizations source("scripts/visualizations.R")
Run the following command in your R console to execute the entire analysis pipeline:
source("scripts/run_analysis.R")
This project is licensed under the MIT License - see the LICENSE file for details.
## 6. License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.