NBA-POTW-Regression: An HTML repository from Shirleyiscool

USF MSDS 601: Linear Regression Analysis

Project: NBA Player of the Week

Team Members

Shirley Li (@Shirleyiscool)
Charles Siu (@chunheisiu)

Description of Dataset

Our dataset is a combination of the following datasets with regards to NBA:

NBA Player of the Week (1985 - 2019)
https://www.kaggle.com/jacobbaruch/nba-player-of-the-week
1,187 Rows, 14 Columns
NBA Player Salary from basketball-reference.com (1991 - 2017)
https://www.kaggle.com/whitefero/nba-player-salary-19902017
11,837 Rows, 7 Columns
NBA Player Salary from basketball-reference.com (2018 - 2019)
https://web.archive.org/web/20181002194236/www.basketball-reference.com/contracts/players.html
578 Rows, 11 Columns
NBA Player Statistics (1985 - 2019)
https://www.basketball-reference.com/leagues/
18,480 Rows, 30 Columns
NBA Yearly Summary (1985 - 2019)
https://www.basketball-reference.com/leagues/
35 Rows, 8 Columns

Combining the aforementioned datasets, we created a dataset in which, each row is an NBA player per season, and each column is a statistic of the player. We filtered the rows so that only the players who have both statistics and salary data for that particular season are included.

There are 9,003 Rows and 38 Columns in the dataset.

Index of the Dataset

Variable	Definition	Type
`Year`	(e.g. `1991` means the NBA 1990 - 1991 Season)	Numerical
`Player`	Player name	Categorical

Variables of the Dataset

Variable	Definition	Type
`Pos`	Player Position	Categorical
`Age`	Age of Player at the start of February 1st of that season	Numerical
`Tm`	Team of Player	Categorical
`G`	Number of games played	Numerical
`GS`	Number of games played when the game started	Numerical
`MP`	Minutes played per game	Numerical
`FG`	Field Goals per game	Numerical
`FGA`	Field Goal attempts per game	Numerical
`FG_Prct`	Field Goal percentage	Numerical
`Three_P`	3-Point Field Goals per game	Numerical
`Three_PA`	3-Point Field Goal attempts per game	Numerical
`Three_P_Prct`	3-Point Field Goal percentage	Numerical
`Two_P`	2-Point Field Goals per game	Numerical
`Two_PA`	2-Point Field Goal attempts per game	Numerical
`Two_P_Prct`	2-Point Field Goal percentage	Numerical
`ePF_Prct`	Effective Field Goal percentage	Numerical
`FT`	Free Throws per game	Numerical
`FTA`	Free Throw attempts per game	Numerical
`FTA_Prct`	Free Throw percentage	Numerical
`ORB`	Offensive Rebounds per game	Numerical
`DRB`	Defensive Rebounds per game	Numerical
`TRB`	Total Rebounds per game	Numerical
`AST`	Assists per game	Numerical
`STL`	Steals per game	Numerical
`BLK`	Blocks per game	Numerical
`TOV`	Turnovers per game	Numerical
`PF`	Personal Fouls per game	Numerical
`PTS`	Points per game	Numerical
`Potw`	Was the player named Player of the Week during the season?	Binary
`APG_Leader`	Was the player named Assists Per Game Leader during the season?	Binary
`MVP`	Was the player named Most Valuable Player during the season?	Binary
`PPG_Leader`	Was the player named Points Per Game Leader during the season?	Binary
`RPG_Leader`	Was the player named Rebounds Per Game Leader during the season?	Binary
`Rookie`	Was the player named Rookie of the Year during the season?	Binary
`WS Leader`	Was the player named Win Shares Leader during the season?	Binary
`Salary`	Player Salary	Numerical

Statement of Research Problems and Methods

Using the dataset, we stemmed two main research problems:

What player statistic contributes the most to the event that the player is named Player of the Week?
Since whether a player is named Player of the Week is a binary variable, we decided to approach this problem using the logistic regression model.
What NBA title, including Player of the Week, has the most weight on the salary of the player?
Since the salary of a player is a numerical variable, we decided to approach this problem using the multiple linear regression model.

For both problems, model selection was performed to find the optimal model, and model diagnosis was performed to mitigate the possible issues of heteroscedasticity, multicollinearity and autocorrelation.

Problem 1: Relationship between Player Statistics and Player of the Week

Explanatory Analysis

After extracting the relevant player statistics and Player of the Week from the dataset, we plotted the relationship between the statistics and Player of the Week using a scatter plot.

Observing the scatter plot, since Potw is a binary variable, the scatter plot did not give us a lot of useful information, apart from the differences in range of statistic values between the Potw = 0 and Potw = 1. For every statistic, the range of values seems to be smaller for Potw = 1, with the most significant variable being eFG_Prct.

This discrepancy in range is also evident in the difference in frequency between Potw = 0 and Potw = 1.

Potw	Count	Prct
0	8505	0.944685
1	498	0.0553149

The frequency table shows that Potw = 0 accounts for 94% of the data, which is to be expected since the number of players receiving an award would always be significantly smaller than those who did not. However, we are not sure if this would effect the reliability of the models we would build in regression analysis.

We also plotted the correlation using a heatmap.

Observing the heatmap, there are evidence that multicollinearity might exist. For example, The most correlated variables are Two_P_Prct and FG_Prct, but this is to be expected since FG_Prct is derived from Two_P_Prct. Similarly, eFG_Prct is derived from FG_Prct, so the correlation is high between them. Hence, some of these variables, specifically those that have direct relationships, will need to be removed prior to regression analysis.

Meanwhile, TOV, AST and STL are highly correlated between one another. However, turnovers, assists and steals are basketball moves often performed by point guards, so there might be indirect relationships between these variables. Nonetheless, these correlations would need to be addressed in regression analysis.

Regression Analysis

Model Selection

As Pos is categorical variables, we first get dummies for this predictors.
Since some statistics are calculated by other statistics, there would be strong multicollinearity if we include all of them. Therefore, we drop these following statistics for our first model.

TRB = ORB + DRB
FGA = FG * FG_Prct
Three_PA = Three_P * Three_P_Prct
Two_PA = Two_P * Two_P_Prct
FTA = FT_P * FT_Prct
PTS = Three_P + Two_P + FT_P
FG = Three_P + Two_P
Then we fit the full model using all the remaining players' statistics, such as Age,G,GS,MPand etc.. Hence, we got the following logistic regression model as follows.

Full Model Summary - Model 1

Given that there are too many variables with high correlation from the heatmap above as well as the there is warnning on multicollinearity, we decided to first use both VIF Factors and Deviance Test to find removable predictors.

VIF analysis on full model

Features	VIF Factor
Age	23.7475
G	11.2644
GS	6.57655
MP	79.4739
FG_Prct	870.907
Three_P	8.40503
Three_P_Prct	6.21584
Two_P	22.851
Two_P_Prct	122.755
eFG_Prct	755.823
FT	10.2741
FT_Prct	21.833
ORB	11.5458
DRB	18.2727
AST	12.1883
STL	10.1388
BLK	3.92813
TOV	23.6123
PF	20.7312
Pos_PF	2.29591
Pos_PG	4.82529
Pos_SF	3.03752
Pos_SG	3.895

Using a function to remove a predictor with max VIF for each VIF test while deleting that predictor would not reject H0 in deviance test and thus choose reduced model.

Hence, we remove predictors FG_Prct, eFG_Prct, TOV, Age, MP, FT_Prct, Two_P_Prct, ORB, which both have high VIF factors and the reduced model with low ΔG in a Deviance Test.

With these remaining predictors, we run a logistic model again and here is our second model.

Reduced Model Summary - Model 2

Features	VIF Factor
Two_P	16.2202
DRB	12.2412
PF	12.0968
STL	9.59197
G	9.12068
FT	8.76224
AST	8.3415
GS	5.37419
Three_P_Prct	4.76798
BLK	3.78012
Pos_PG	3.45641
Three_P	3.4551
Pos_SG	2.54538
Pos_SF	2.13738
Pos_PF	1.90174

But still there are some remaining predictors with VIF Factor larger than 10.

To make sure whether reduced model is better than the full model, we do a deviance test.

Null Hypothesis: Reduced Model
Alternative Hypothesis: Full Model

ΔG = ΔG(Reduced Model) - ΔG(Full Model) = 13.7661
χ2 = 15.5073

On significant level of 0.05, ΔG > χ2. Therefore, we cannot reject Null Hypothesis and then choose Model 2.

But as Wald test shows that there still seems some insignificant predictors with p-values larger than 0.05. Therefore, we continue to remove predictors using Deviance Test and Wald Test. Here are removable predictors based on Deviance Test.

Deviance test	GS	Three_P_Prct	Pos_PG	Pos_SG
delta_G	14.9021	14.5091	14.2247	14.3993
chi2_stat	16.9190	16.9190	16.9190	16.9190

However, we use position as dummies variables. So, if we drop Pos_PG and Pos_SG, we need to drop other 2 other variables. In this case, dropping too many predictors, Deviance Test would tell us to stick to the full model.

Hence, we only drop variables GS and Three_P_Prct and keep Pos dummies.

Reduced Model Summary - Model 3

So far, here is the main logistic model we'll use.

Model Diagnosis

Multicollinearity

Features	VIF Factor
Two_P	15.268
DRB	12.0013
PF	11.9826
STL	9.41759
FT	8.6996
G	8.27271
AST	8.09547
BLK	3.77999
Pos_PG	2.90391
Three_P	2.76944
Pos_SG	2.12558
Pos_SF	1.80177
Pos_PF	1.73616

The VIF table above indicates that there is multicollinearity problem in this model. But we don't choose to drop those predictors with high VIF as both Deviance test and Wald test consider them as significant. So we choose not to drop these predictors.

Pearson residuals Plot -- Test Heteroscedasticity

From the graph above, we can see there are some "studentized residuals" with absolute values larger than 3, which indicates there may be outliers or influential points causing heteroscedasticity.

To find the outliers and influential points, here we plot residuals as well as cook's distance.

Internally Studentized Residuals

Cook's Distance

Given cook's distance, Diffits and Studentized Residuals,, here we find 316 influential points. Since 316 observations take only about 5% of the total observations. Therefore, we drop these observations and rerun the model.

Final Model - Model 4

Reduced Model Summary - Model 4

Here is our final model. To confirm that whether it is the best model we have run, we compare AIC and BIC of the above 4 models.

Model	AIC	BIC
Model 1	1656.71	-80147.9
Model 2	1654.48	-80207
Model 3	1652.39	-80223.3
Model 4	203.846	-78484.6

Clearly, before dropping outliers and influential points, Model 3 has the lowest AIC and BIC, showing Model 3 is better than Model 1 and Model 2. After we drop outliers and influential points, AIC of Model decreases a lot while BIC increases a little bit. So we will choose Model 4 as our final model.

Model 4 - Internally Studentized Residuals

After removing outliers, the residual plots seems better.

Model 4 - π Plot

Here we visualize how π changes with the model.

Final Model Summary

Variables

Predictors	βi	e^(βi)
Intercept	-43.98931819527114	7.86469427844486e-20
G	0.16684022971613738	1.181565471176185
Three_P	2.489105636067465	12.050493772492525
Two_P	1.9363248937885942	6.9332237580619385
FT	1.3465680005479181	3.8442095399977703
DRB	1.0637258193048331	2.89714514483543
AST	0.42443823739809583	1.52873139427614
STL	1.9159062275216838	6.793092095448223
BLK	1.3372375113410921	3.808507999802208
PF	-1.438304019319382	0.23732992451989693
Pos_PF	-1.6188977597716383	0.1981169512519482
Pos_PG	2.351391076003565	10.50016609913466
Pos_SF	-2.4763565102424177	0.08404889969899432
Pos_SG	-0.8486571509347313	0.4279892712297531

Formula

Interpretation of Model

Intercept: the probability for a player win the award Player of the Week is 7.8647e-20, which is super small.
G : While controlling other variables, the odds for a player, who plays 1 more game, to win the POTW increase 18%.
Three_P: While controlling other variables, the odds for a player who can have one more 3-Point Field Goals per game, to win the POTW increase about 11 times.
Two_P: While controlling other variables, the odds for a player, who can have one more 2-point field goals per game, to win the POTW increase about 6 times.
FT: While controlling other variables, the odds for a player, who can have one more free throw per game, to win the POTW increase about 2.8 times.
DRB: While controlling other variables, the odds for a player, who can have one more defensive rebounds per game, to win the POTW increase about 1.9 times.
AST: While controlling other variables, the odds for a player, who can have one more assists per game, to win the POTW increase about 53%.
STL: While controlling other variables, the odds for a player, who can have one more steals per game, to win the POTW increase about 5.8 times.
BLK: While controlling other variables, the odds for a player, who can have one more blocks per game, to win the POTW increase about 2.8 times.
PF: While controlling other variables, the odds for a player, who can have one more personal fouls per game, to win the POTW decrease about 77%.
Pos_PF: While controlling other variables, the odds for a power forward is 80% less than center.
Pos_PG: While controlling other variables, the odds for a points guard is 9.5 times more than center.
Pos_SF: While controlling other variables, the odds for a small forward is 92% less than center.
Pos_SG: While controlling other variables, the odds for a shooting guard is 57% less than center.

To summarize, the model indicates that 3-Point Field Goals per game attach the most importance to decide whether a player could get player of the week. Besides, the chance for a point guard to win player of the week is larger than other players. If a player wants to increase his chance of winning player of the week, increasing 2-point field goals per game, free throw per game, steals, assists, blocks and defensive rebounds as well as decreasing personal fouls would be recommended.

Prediction of Model

Intercept	G	Three_P	Two_P	FT	DRB	AST	STL	BLK	PF	Pos_PF	Pos_PG	Pos_SF	Pos_SG	Predicted πi
1	56	0.9	2.1	1	2.5	1.5	0.6	0.3	1.9	0	0	0	0	0
1	82	5.1	9.3	0	11.1	10.7	2.4	2.7	3.8	0	1	0	0	1

We use the median statistic of 2019 and max statistic of 2019 to do prediction. As a result, the probability of a player with median performance has 0% chance to win POTW while a player with max performance has 99.99% chance to win POTW. This prediction successfully indicates our model can predict whether a player could win POTW based on his performance to some extent.