Student Exam Scores

In this project, I analyzed data related to student scores and reduced the dimensionality of the data to keep the most important features only.

Then, I built a predictive model to test the hypothesis that the quiz and major exam scores are linearly related to the final exam score.

1 Problem #A: Exploratory Data Analysis Using Correlation and Principal Components for Student Scores

2 Problem #B: Regression Analysis for Student Scores

Metadata and Details

To perform the required analysis, there are 3 datasets with the following details:

Consider data given in file StudentScoresDataA and the following data description:

Table 1: Data Description

Field	Description
Gender	Gender of the student
Location	Home City of the student
Quiz-1	Score of the student in Quiz-1
Quiz-2	Score of the student in Quiz-2
Quiz-3	Score of the student in Quiz-3
Quiz-4	Score of the student in Quiz-4
Major-1	Score of the student in Major-1
Major-2	Score of the student in Major-2
Major-3	Score of the student in Major-3
Final	Score of the student in the final exam

Problem #A

Do the following tasks using data given in StudentScoresDataA and Table-1:

A-1: Given Data. Read and display the data. Identify the number of rows and columns. Does any column have missing data? Display the description of both numeric and non-numeric columns.

A-2: Type Consistency. For each column in StudentScoresDataA, identify type of each field and verify that each column in Python is identified correctly. If there is any inconsistency, then resolve it.

A-3: Normalization. For each score column in StudentScoresDataA, apply the standard scaler, such that the mean is zero and standard deviation is one.

A-4: Visualization. Draw pairwise scatter plots each pair of score columns in StudentScoresDataA. Also, for each score column draw KDE (Kernel Density Estimation). Differentiate the pairwise plots and KDE plot by ’Gender’ column.

A-5: Correlation Analysis. Do the following: Calculate the correlation between all the score columns of StudentScoresDataA. Identify top 3 variables that are highly correlated with ’Final’ score column. Which pair of score columns are strongly correlated?

A-6: PCA. Do the following: Get first two principal components of the data without considering ’Gender’, ’Location’ and ’Final’ columns. Add the two principal components to the data frame and rename the components ’PC1’ and ’PC2’ respectively. Construct a scatter plot using the first two principal components of the data. Can the principal components separate ’Final’ variable? Differentiate the above plot using ’Gender’. In a separate plot, differentiate the above plot using ’Location’. How much variation do each principal component capture? What are the coefficients (the u vector) of the linear combination of input variables for the first PC?

Problem #A

#A-1: Understanding the Given Data:
import pandas as pd
import numpy as np
df = pd.read_csv('StudentScoresDataA.csv', delimiter =',')
display(df)


# Count the number of rows and columns in the data
print(f'The number of rows are {len(df.index)}, and the number of columns are {len(df.columns)}')
# Count the number of non-null rows for each column
print(f'The number of non-null rows for each column are:\n{df.count()}')
null =df.columns[df.isna().any()]
print(f'The columns containing missing data are:\n{null}') # No columns containing missing data
display(df.describe(include="number"))# only numerical
display(df.describe(include="object")) # only categorical

	Gender	Location	Quiz-1	Quiz-2	Quiz-3	Quiz-4	Major-1	Major-2	Major-3	Final
0	Male	Jeddah	17	17	24	16	100	92	85	86
1	Female	Jeddah	10	21	14	8	16	28	17	25
2	Female	Jeddah	7	21	23	7	55	45	41	43
3	Male	Dammam	22	10	21	12	80	70	67	68
4	Female	Dammam	21	15	17	18	81	79	78	81
...	...	...	...	...	...	...	...	...	...	...
3495	Female	Riyadh	15	25	20	15	87	93	89	81
3496	Male	Dammam	15	14	17	8	48	45	46	43
3497	Male	Riyadh	20	23	20	18	71	77	82	73
3498	Female	Riyadh	10	16	6	12	63	68	49	60
3499	Male	Dammam	12	12	24	5	37	36	26	34

3500 rows × 10 columns

The number of rows are 3500, and the number of columns are 10
The number of non-null rows for each column are:
Gender      3500
Location    3500
Quiz-1      3500
Quiz-2      3500
Quiz-3      3500
Quiz-4      3500
Major-1     3500
Major-2     3500
Major-3     3500
Final       3500
dtype: int64
The columns containing missing data are:
Index([], dtype='object')

	Quiz-1	Quiz-2	Quiz-3	Quiz-4	Major-1	Major-2	Major-3	Final
count	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000	3500.000000
mean	12.414571	14.814857	14.924286	12.523714	55.954571	55.949143	56.154286	54.019714
std	5.618516	6.001572	6.094121	5.612041	21.552887	21.734182	21.803054	21.418980
min	0.000000	5.000000	5.000000	0.000000	12.000000	12.000000	12.000000	8.000000
25%	8.000000	10.000000	10.000000	8.000000	38.000000	38.000000	38.000000	36.000000
50%	12.000000	15.000000	15.000000	13.000000	56.000000	56.000000	56.000000	54.000000
75%	17.000000	20.000000	20.000000	17.000000	74.000000	74.000000	74.000000	72.000000
max	25.000000	25.000000	25.000000	25.000000	100.000000	100.000000	100.000000	100.000000

	Gender	Location
count	3500	3500
unique	2	7
top	Male	Madinah
freq	1756	519

#A-2: Check for Type Consistency
df.info()
print('''As can be seen from the data above and the info below:
gender is object Dtype as it should be.
Location is object Dtype as it should be.
Quiz-1 is numerical as it should be.
Quiz-2 is numerical as it should be.
Quiz-3 is numerical as it should be.
Quiz-4 is numerical as it should be. 
Major-1 is numerical as it should be. 
Major-2 is numerical as it should be. 
Major-3 is numerical as it should be. 
Final is numerical as it should be.''')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Gender    3500 non-null   object
 1   Location  3500 non-null   object
 2   Quiz-1    3500 non-null   int64 
 3   Quiz-2    3500 non-null   int64 
 4   Quiz-3    3500 non-null   int64 
 5   Quiz-4    3500 non-null   int64 
 6   Major-1   3500 non-null   int64 
 7   Major-2   3500 non-null   int64 
 8   Major-3   3500 non-null   int64 
 9   Final     3500 non-null   int64 
dtypes: int64(8), object(2)
memory usage: 273.6+ KB
As can be seen from the data above and the info below:
gender is object Dtype as it should be.
Location is object Dtype as it should be.
Quiz-1 is numerical as it should be.
Quiz-2 is numerical as it should be.
Quiz-3 is numerical as it should be.
Quiz-4 is numerical as it should be. 
Major-1 is numerical as it should be. 
Major-2 is numerical as it should be. 
Major-3 is numerical as it should be. 
Final is numerical as it should be.

#A-3: Normalization: normalizae the data such that the mean is zero and standard deviation is one.
from sklearn.preprocessing import StandardScaler # import
selected_columns = df.select_dtypes(include="number").columns
for c in selected_columns:
    scaler = StandardScaler() # creating object
    scaler.fit(df[[c]]) # fitting
    df[c]= scaler.transform(df[[c]]) #transforming
    print(f'The mean and std values for {c} are', abs(np.round(df[c].mean())), ' and ', np.round(df[c].std()))
display(df)

The mean and std values for Quiz-1 are 0.0  and  1.0
The mean and std values for Quiz-2 are 0.0  and  1.0
The mean and std values for Quiz-3 are 0.0  and  1.0
The mean and std values for Quiz-4 are 0.0  and  1.0
The mean and std values for Major-1 are 0.0  and  1.0
The mean and std values for Major-2 are 0.0  and  1.0
The mean and std values for Major-3 are 0.0  and  1.0
The mean and std values for Final are 0.0  and  1.0

	Gender	Location	Quiz-1	Quiz-2	Quiz-3	Quiz-4	Major-1	Major-2	Major-3	Final
0	Male	Jeddah	0.816245	0.364147	1.489470	0.619522	2.043890	1.658954	1.323202	1.493295
1	Female	Jeddah	-0.429814	1.030734	-0.151690	-0.806188	-1.854057	-1.286137	-1.796073	-1.355053
2	Female	Jeddah	-0.963839	1.030734	1.325354	-0.984402	-0.044296	-0.503847	-0.695153	-0.514557
3	Male	Dammam	1.706287	-0.802381	0.997122	-0.093333	1.115807	0.646579	0.497511	0.652799
4	Female	Dammam	1.528278	0.030853	0.340658	0.975949	1.162211	1.060732	1.002100	1.259824
...	...	...	...	...	...	...	...	...	...	...
3495	Female	Riyadh	0.460228	1.697322	0.833006	0.441308	1.440636	1.704971	1.506688	1.259824
3496	Male	Dammam	0.460228	-0.135793	0.340658	-0.806188	-0.369125	-0.503847	-0.465794	-0.514557
3497	Male	Riyadh	1.350270	1.364028	0.833006	0.975949	0.698170	0.968698	1.185587	0.886270
3498	Female	Riyadh	-0.429814	0.197500	-1.464618	-0.093333	0.326937	0.554545	-0.328179	0.279245
3499	Male	Dammam	-0.073797	-0.469087	1.489470	-1.340829	-0.879570	-0.918001	-1.383228	-0.934805

3500 rows × 10 columns

#A-4: Data Visualization:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure()
sns.pairplot(vars=selected_columns,kind="scatter",hue ='Gender',data=df)#pairwise scatter plot for each numerical pair
plt.show()
#-----
plt.figure
nCols=3   # How many columns wanted
nRows=len(selected_columns)//nCols+1
fig,axes = plt.subplots(nRows, nCols, figsize=(20,10)) 

for ind,col in enumerate(selected_columns): #each numerical column  (تاكد)
    sns.histplot(x=col,bins=10,ax = axes.flatten()[ind],hue ='Gender',kde=True,palette=['r','b'],data=df) 
    plt.xticks(rotation=45)
plt.show()

<Figure size 432x288 with 0 Axes>

#A-5: Correlation Analysis
corr = df.corr()
display(corr.style.background_gradient(cmap='coolwarm').set_precision(4))

corr=corr.apply(lambda x: np.abs(x)) # negative correlation will be considered
sorted_corr = corr.sort_values(by=['Final'], ascending=False)#sort by : to get what variables are correlated with Final
sorted_corr = sorted_corr['Final'].index # store the indecies here to access them next
print('The top three correlated input variables are: ',sorted_corr[1:4].tolist())# discarding 0 (Final vs. Final = 1)

#Which pair of score columns are strongly correlated?
# Quiz-1 is highly correlated with (Quiz-4, Major-1, Major-2, Major-3)
# Quiz-4 is highly correlated with (Quiz-1, Major-1, Major-2, Major-3)
# Major-1 is highly correlated with (Quiz-1, Quiz-4, Major-2, Major-3)
# Major-2 is highly correlated with (Quiz-1, Quiz-4, Major-1, Major-3)
# Major-3 is highly correlated with (Quiz-1, Quiz-4, Major-1, Major-2)
# The pair that is strongly correlated is (Major-2 vs. Final)

The top three correlated input variables are:  ['Major-2', 'Major-1', 'Major-3']

#A-6: Principal Component Analysis 
pca_input_columns=selected_columns.drop(['Final']) # only numerical and without the output (Final)
from sklearn.decomposition import PCA 
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df[pca_input_columns]) 
df['PC1'] = principalComponents[:,0] 
df['PC2'] = principalComponents[:,1] 
# #-----
# Draw the PCAs using the normalized values, but use the original data to differentiate (coloring legends) 
df_copy = pd.read_csv('StudentScoresDataA.csv', delimiter =',') #keeping a copy of the original dataframe to set the legends
custom_hue = df_copy['Final'].apply(lambda x: "(85-100)" if x>=85 else ("(65-85)" if x>=65 and x<85
                                   else "(0-65)"))
plt.figure()
sns.relplot(x='PC1',y='PC2',
            hue = custom_hue, palette=['g','r','b'],
            kind='scatter',alpha=0.75,
            height=5, aspect=1, 
            data=df)
plt.title('PCs vs Final', fontsize = 20)
plt.show()
print(f'The PCs show good separation, with some overlap. even without using Final as an input')
# By Gender:
plt.figure()
Gender = sns.relplot(x='PC1',y='PC2',
            hue = custom_hue, palette=['g','r','b'], 
            col = "Gender",
            kind='scatter',alpha=0.75,
            height=5, aspect=1, 
            data=df)
Gender.fig.suptitle('PCs; Final By Gender') 
Gender.fig.subplots_adjust(top = 0.8)
plt.show()
# By Location:
plt.figure()
Location = sns.relplot(x='PC1',y='PC2', 
            hue = custom_hue, palette=['g','r','b'], 
            style = "Location", 
            kind='scatter',alpha=0.75,
            height=5, aspect=1, 
            data=df)
Location.fig.suptitle('PCs; Final By Location') 
Location.fig.subplots_adjust(top = 0.8)
plt.show()
#----   
# How much variation do each principal component capture?
for ind,var in enumerate(pca.explained_variance_ratio_):
    print(f'The variance captured by pca {ind+1} is: {var*100: 0.2f}%')
#What are the coeffcients (the u vector) of the linear combination of input variables for the first PC?
pca_1_coefficients=np.round(pca.components_[0],4)
zipList=list(zip(pca_input_columns,pca_1_coefficients))
# create new dataframe >> this is just to display it as a dataframe
ndf=pd.DataFrame.from_records(zipList,columns=['field','coeff']) # give new columns and display in df for visula
display(ndf.sort_values(by=['coeff'], ascending=False))

<Figure size 432x288 with 0 Axes>

The PCs show good separation, with some overlap. even without using Final as an input



<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>

The variance captured by pca 1 is:  61.31%
The variance captured by pca 2 is:  14.33%

	field	coeff
2	Quiz-3	0.0163
1	Quiz-2	0.0151
0	Quiz-1	-0.4213
3	Quiz-4	-0.4227
4	Major-1	-0.4626
6	Major-3	-0.4628
5	Major-2	-0.4638

Problem #B

Consider data given in CSV file StudentScoresDataB . Consider the following data description:

Table 2: Data Description

Field	Description
Quiz-1	Score of the student in Quiz-1
Quiz-2	Score of the student in Quiz-2
Quiz-3	Score of the student in Quiz-3
Quiz-4	Score of the student in Quiz-4
Major-1	Score of the student in Major-1
Major-2	Score of the student in Major-2
Major-3	Score of the student in Major-3
Final	Score of the student in the final exam

Do the following tasks using data given in StudentScoresDataB and Table-2:

B-1: Given Data. Read and display the data. Identify the number of rows and columns. Does any column have missing data? Display the statistical summaries of all the columns.

B-2: Type Consistency. For each column in StudentScoresDataB, identify the type for each field based on value. Also, identify the datatypes in Python. Report and resolve any inconsistency.

B-3: Normalization. For each column in StudentScoresDataB, apply the standard scaler, such that the mean is zero and standard deviation is one. Display the summaries of all the columns.

B-4: Cross Normalization. For each column in StudentScoresDataC, apply the standard scaler fitted (learned) from StudentScoresDataB data. Display the summaries of all the columns in StudentScoresDataC data.

B-5: OLS Regression. The hypothesis is that the quiz and major exam scores are linearly related to final exam score. Use the following formula to calculate the OLS coefficient estimates of all StudentScoresDataB data. Take column ‘Final’ as the output column, and all other columns as input column.

B-6: OLS Regression. The hypothesis is that the quiz and major exam scores are linearly related to the final exam score. Do the following: Use the sklearn library to calculate the OLS coefficient estimates of all StudentScoresDataB data. Take column ‘Final’ as the output column, and all other columns as input column. Compare the coefficients obtained in Part B-5 with the above coefficients. Report any differences in between the coefficients from Parts B-5 and B-6. Using the above OLS coefficient estimates, calculate the MSE for data given in StudentScoresDataC.

B-7: Ridge Regression. It may be possible that the quiz and major exam scores are not really independent. Thus, the coefficients needs regularization (penalization). Do the following: Do the ridge analysis, taking all StudentScoresDataB data as the training data. Use 10-fold cross validation, and pick thebest value of alpha from 10−3*,* 10−2*,* 10−1*,* 100*,* 101*,* 102*,* 103. Using the above coefficient estimates, calculate the MSE for data given in StudentScoresDataC.

B-8: Lasso Regression. It may be possible that not all the quiz and major exam scores are helpful in predicting final score. Thus, the coefficients needs selection (penalization). Do the following: Do the lasso analysis, taking all StudentScoresDataB data as the training data. Use 10-fold cross validation, and pick thebest value of alpha from 10−3*,* 10−2*,* 10−1*,* 100*,* 101*,* 102*,* 103. Using the above coefficient estimates, calculate the MSE for data given in StudentScoresDataC.

B-9: Regression Analysis. Compare and contrast the coefficient estimates obtained from Parts B-6, B-7, B-8 and B-9.

#B-1: Understand the Given Data:
import pandas as pd
import numpy as np
df = pd.read_csv('StudentScoresDataB.csv', delimiter =',')
display(df)


# Count the number of rows and columns in the data
print(f'The number of rows are {len(df.index)}, and the number of columns are {len(df.columns)}')
# Count the number of non-null rows for each column
print(f'The number of non-null rows for each column are:\n{df.count()}')
null =df.columns[df.isna().any()]
print(f'The columns containing missing data are:\n{null}') # No columns containing missing data
display(df.describe())# only numerical

	Quiz-1	Quiz-2	Quiz-3	Quiz-4	Major-1	Major-2	Major-3	Final
0	21	5	14	20	84	82	92	86
1	13	21	10	10	40	41	44	38
2	4	7	24	7	27	22	25	20
3	22	7	19	14	81	92	93	84
4	8	6	17	9	30	39	33	36
...	...	...	...	...	...	...	...	...
2894	13	15	11	10	68	60	72	60
2895	9	11	7	6	27	27	28	26
2896	11	13	9	8	55	51	48	49
2897	15	25	21	8	46	46	38	43
2898	16	16	10	18	73	81	79	73

2899 rows × 8 columns

The number of rows are 2899, and the number of columns are 8
The number of non-null rows for each column are:
Quiz-1     2899
Quiz-2     2899
Quiz-3     2899
Quiz-4     2899
Major-1    2899
Major-2    2899
Major-3    2899
Final      2899
dtype: int64
The columns containing missing data are:
Index([], dtype='object')

	Quiz-1	Quiz-2	Quiz-3	Quiz-4	Major-1	Major-2	Major-3	Final
count	2899.000000	2899.000000	2899.000000	2899.000000	2899.000000	2899.000000	2899.000000	2899.000000
mean	12.574681	15.077268	15.058296	12.563298	55.942049	55.930321	55.997240	53.903415
std	5.526165	6.101757	6.115778	5.688581	21.771147	21.761616	21.796679	21.396040
min	0.000000	5.000000	5.000000	0.000000	12.000000	12.000000	12.000000	7.000000
25%	9.000000	10.000000	10.000000	8.000000	37.000000	38.000000	37.000000	36.000000
50%	13.000000	15.000000	15.000000	13.000000	56.000000	56.000000	56.000000	54.000000
75%	17.000000	20.000000	20.000000	17.000000	74.000000	75.000000	74.000000	72.000000
max	25.000000	25.000000	25.000000	25.000000	100.000000	100.000000	100.000000	100.000000

#B-2: Check for Type Consistency
df.info()
print('''As can be seen from the data above and the info below:
Quiz-1 is numerical as it should be.
Quiz-2 is numerical as it should be.
Quiz-3 is numerical as it should be.
Quiz-4 is numerical as it should be. 
Major-1 is numerical as it should be. 
Major-2 is numerical as it should be. 
Major-3 is numerical as it should be. 
Final is numerical as it should be.''')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2899 entries, 0 to 2898
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Quiz-1   2899 non-null   int64
 1   Quiz-2   2899 non-null   int64
 2   Quiz-3   2899 non-null   int64
 3   Quiz-4   2899 non-null   int64
 4   Major-1  2899 non-null   int64
 5   Major-2  2899 non-null   int64
 6   Major-3  2899 non-null   int64
 7   Final    2899 non-null   int64
dtypes: int64(8)
memory usage: 181.3 KB
As can be seen from the data above and the info below:
Quiz-1 is numerical as it should be.
Quiz-2 is numerical as it should be.
Quiz-3 is numerical as it should be.
Quiz-4 is numerical as it should be. 
Major-1 is numerical as it should be. 
Major-2 is numerical as it should be. 
Major-3 is numerical as it should be. 
Final is numerical as it should be.

#B-3: Normalization: Normalize the values such that the mean is zero and standard deviation is one
from sklearn.preprocessing import StandardScaler # import
scaler = StandardScaler()
scaler.fit(df)
df = pd.DataFrame(scaler.transform(df),columns=df.columns)
for c in df.columns:
    print(f'The mean and std values for {c} are', abs(np.round(df[c].mean())), ' and ', np.round(df[c].std()))
display(df.describe())# only numerical

The mean and std values for Quiz-1 are 0.0  and  1.0
The mean and std values for Quiz-2 are 0.0  and  1.0
The mean and std values for Quiz-3 are 0.0  and  1.0
The mean and std values for Quiz-4 are 0.0  and  1.0
The mean and std values for Major-1 are 0.0  and  1.0
The mean and std values for Major-2 are 0.0  and  1.0
The mean and std values for Major-3 are 0.0  and  1.0
The mean and std values for Final are 0.0  and  1.0

	Quiz-1	Quiz-2	Quiz-3	Quiz-4	Major-1	Major-2	Major-3	Final
count	2.899000e+03	2.899000e+03	2.899000e+03	2.899000e+03	2.899000e+03	2.899000e+03	2.899000e+03	2.899000e+03
mean	2.450993e-18	3.125015e-17	-1.115202e-16	-1.470596e-16	9.803970e-18	-1.470596e-17	3.186290e-17	1.323536e-16
std	1.000173e+00	1.000173e+00	1.000173e+00	1.000173e+00	1.000173e+00	1.000173e+00	1.000173e+00	1.000173e+00
min	-2.275873e+00	-1.651821e+00	-1.644931e+00	-2.208893e+00	-2.018710e+00	-2.019055e+00	-2.018878e+00	-2.192532e+00
25%	-6.469762e-01	-8.322430e-01	-8.272322e-01	-8.023240e-01	-8.702030e-01	-8.240846e-01	-8.717162e-01	-8.369074e-01
50%	7.697787e-02	-1.266543e-02	-9.533704e-03	7.678148e-02	2.662286e-03	3.202484e-03	1.266270e-04	4.514933e-03
75%	8.009320e-01	8.069121e-01	8.081648e-01	7.800659e-01	8.295873e-01	8.764500e-01	8.260830e-01	8.459372e-01
max	2.248840e+00	1.626490e+00	1.625863e+00	2.186635e+00	2.024034e+00	2.025460e+00	2.019131e+00	2.154816e+00

#B-4: Cross Normalization: apply the standard scaler fitted from StudentScoresDataB
df2 = pd.read_csv('StudentScoresDataC.csv', delimiter =',')
from sklearn.preprocessing import StandardScaler
df2 = pd.DataFrame(scaler.transform(df2),columns=df.columns)
for c2 in df2.columns:
    print(f'The mean and std values for {c2} are', abs(np.round(df2[c2].mean())), ' and ', np.round(df2[c2].std()))
display(df2.describe())

The mean and std values for Quiz-1 are 0.0  and  1.0
The mean and std values for Quiz-2 are 0.0  and  1.0
The mean and std values for Quiz-3 are 0.0  and  1.0
The mean and std values for Quiz-4 are 0.0  and  1.0
The mean and std values for Major-1 are 0.0  and  1.0
The mean and std values for Major-2 are 0.0  and  1.0
The mean and std values for Major-3 are 0.0  and  1.0
The mean and std values for Final are 0.0  and  1.0

	Quiz-1	Quiz-2	Quiz-3	Quiz-4	Major-1	Major-2	Major-3	Final
count	324.000000	324.000000	324.000000	324.000000	324.000000	324.000000	324.000000	324.000000
mean	-0.044798	-0.088552	0.116654	-0.108265	-0.059300	-0.078363	-0.034855	0.013316
std	1.031785	1.034225	0.956155	1.038197	0.999085	1.021849	1.021145	1.075939
min	-2.275873	-1.651821	-1.644931	-2.208893	-2.018710	-1.927134	-2.018878	-2.005549
25%	-0.692223	-0.996158	-0.663692	-0.802324	-0.870203	-0.916005	-0.883188	-0.942085
50%	-0.104011	-0.176581	0.154006	-0.099040	-0.089218	-0.134679	0.000127	0.004515
75%	0.800932	0.806912	0.971704	0.604245	0.737707	0.784529	0.826083	0.951115
max	2.067852	1.626490	1.625863	2.186635	1.886214	1.979500	1.927358	2.154816

#B-5: OLS Regression: The hypothesis is that the quiz and major exam scores are linearly related to final exam score.
import numpy as np
X = df.iloc[:,:-1].values # No need to add a new ones column, since we applied standard scaler
y = df.iloc[:,-1].values
best_beta = np.linalg.inv(X.T @ X) @ X.T @ y 
print('The closed form estimates are:', np.round(best_beta,2).tolist())

The closed form estimates are: [0.1, -0.0, -0.0, 0.22, 0.32, 0.42, 0.0]

#B-6: OLS Regression (Using sklearn)
from sklearn.linear_model import LinearRegression
reg1 = LinearRegression().fit(X, y)
best_betas =  np.round(reg1.coef_,2) # beta 1,2,....
best_beta_0 = np.round(reg1.intercept_,2) # beta note
print(f'The estimates using sklearn are :', best_beta_0, best_betas.tolist()) # beta note is 0.0
print(f'It appears to have the same estimates as the OLS closed form (found in B-5 above)')
#-----
X_train = df.iloc[:,:-1].values # the training set is represented by dataB
y_train = df.iloc[:,-1].values
X_test = df2.iloc[:,:-1].values # the testing set is represented by dataC
y_test = df2.iloc[:,-1].values
#--------
#Using the above OLS coeffcient estimates, calculate the MSE for data given in HW5DataC
## OLS (Ordinary Least Squares Linear Regression)
from sklearn.metrics import mean_squared_error
y_pred1 = reg1.predict(X_test)
print('The MSE using OLS is:', mean_squared_error(y_test, y_pred1))

The estimates using sklearn are : 0.0 [0.1, -0.0, -0.0, 0.22, 0.32, 0.42, 0.0]
It appears to have the same estimates as the OLS closed form (found in B-5 above)
The MSE using OLS is: 0.01842862368733385

#B-7: Ridge Regression. It may be possible that the quiz and major exam scores are not really independent. 
# Thus, the coefficients needs regularization (penalization).
## Ridge
from sklearn.linear_model import RidgeCV
reg2 = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], fit_intercept=False,cv=10).fit(X_train, y_train) 
y_pred2 = reg2.predict(X_test) 
print('The MSE using OLS is:', mean_squared_error(y_test, y_pred2))

The MSE using OLS is: 0.018428417221315444

#B-8: Lasso Regression. It may be possible that not all the quiz and major exam scores are helpful in predicting final score. 
# Thus, the coefficients needs selection (penalization).
## Lasso
from sklearn.linear_model import LassoCV 
reg3 = LassoCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], 
               fit_intercept=False,cv=10, random_state=0).fit(X_train, y_train) 
y_pred3 = reg3.predict(X_test) 
print('The MSE using Lasso is:', mean_squared_error(y_test, y_pred3))

The MSE using Lasso is: 0.018509010803085785

#B-9: Regression Analysis: Compare and contrast the coefficient estimates
print('The coefficient estimates using OLS Regression are:', np.round(reg1.coef_,3))
print('The coefficient estimates using Ridge Regression are:', np.round(reg2.coef_,3))
print('The coefficient estimates using Lasso Regression are',np.round(reg3.coef_,3))
print()
#As can be seen from the result below, the coefficient estimates are very similar 
best_MSE = min(mean_squared_error(y_test, y_pred1),mean_squared_error(y_test, y_pred2),mean_squared_error(y_test, y_pred3))
print(f'Ridge performs better than OLS and Lasso for this data with minumum MSE of {best_MSE}')

The coefficient estimates using OLS Regression are: [ 0.104 -0.001 -0.002  0.218  0.317  0.421  0.001]
The coefficient estimates using Ridge Regression are: [ 0.104 -0.001 -0.002  0.218  0.317  0.421  0.001]
The coefficient estimates using Lasso Regression are [ 0.104 -0.    -0.001  0.218  0.318  0.421  0.   ]

Ridge performs better than OLS and Lasso for this data with minumum MSE of 0.018428417221315444

🛠 Skills Used

Python, Exploratory Data Analysis, Principal Component Analysis, Linear Regression, Ridge Regression, Lasso Regression, Normalization, Correlation Analysis

🚀 About Me

👋 Hi, I’m @Raed-Alshehri

👀 I’m interested in data science, machine learning, and statistics.

🌱 I’m applying my skills in the data analytics field using Python, R, and SQL

🔗 Links

Feedback

If you have any feedback, please reach out to me at alshehri.raeda@gmail.com

Raed-Alshehri/Student-Exam-Scores

Student Exam Scores

Table of Contents

Metadata and Details

Problem #A

Problem #A

Problem #B

🛠 Skills Used

🚀 About Me

🔗 Links

Feedback