This project investigates shooting patterns and defensive performance in NBA games during the 2014-2015 season. By analyzing shot logs, we aim to understand key aspects such as shooting efficiency, defensive impact, and the prevalence of outliers. Our analysis provides insights that could aid coaches and team management in strategic decision-making.
-
Shooting Efficiency:
- How efficient were players in making their shots?
- Were there trends based on shot distance, shot clock, or defender proximity?
-
Defensive Impact:
- What was the effect of the closest defender on shooting percentages?
- What insights can be derived from defensive strategies employed by teams and players?
-
Outlier Detection:
- How prevalent were outliers and how did they affect the analysis?
- How should outliers be addressed to ensure data integrity?
- Read data from
shot_logs.csv
into a pandas DataFrame.
- Handle null values using forward filling.
- Replace negative values in the
TOUCH_TIME
column with the mean of positive values. - Remove unnecessary columns (e.g.,
FGM
which duplicatesSHOT_RESULT
). - Perform label encoding on categorical columns.
- Convert
GAME_CLOCK
to seconds. - Optionally handle outliers by calculating and removing values outside the lower and upper bounds.
- Generate correlation heatmaps and correlation circle plots to explore relationships between features.
- Detect and visualize outliers using box plots.
- Split data into 75% training and 25% evaluation sets.
- Train and evaluate four different models: Logistic Regression, Random Forest, K-Nearest Neighbors (KNN), and Gradient Boosting.
- Implement two Map-Reduce tasks using PySpark:
- Identify the most scoring players in the last 30 seconds of the last quarter.
- Identify top shot players who scored 3 points with 0 dribbles.
- Handling Null Values: Fill null values with forward filling.
- Handling Negative Touch Time: Replace negative values with the mean of positive values.
- Removing Unnecessary Columns: Drop the
FGM
column. - Label Encoding: Encode the
W
column as binary (1 for W, 0 for L). - Time Conversion: Convert
GAME_CLOCK
to seconds. - Outlier Handling: Remove outliers based on calculated bounds.
The heatmap uses color to represent the strength of correlations between features. Darker colors indicate weaker correlations.
Shots made from over 23.9 yards count as 3 points. If fouled and scored, players earn an extra free throw. This explains why about 8% of shots from less than 23.9 yards are recorded as 3 points.
The plot visually represents relationships between features. Closer points indicate stronger correlations. The plot helps understand the interplay between different features.
Visualize outliers using box plots to identify and handle anomalies in the data.
Train a logistic regression model and evaluate its performance.
Train a random forest model and evaluate its performance.
Train a KNN model and evaluate its performance.
Train a gradient boosting model and evaluate its performance.
Use PySpark to implement a Map-Reduce function to identify the top scoring players in the last 30 seconds of the last quarter.
Use PySpark to implement a Map-Reduce function to identify top players who scored 3 points with 0 dribbles.