Files bestfitline.py, Figure-1.png, output.txt are to find optimum equation for best fit line for a few given points.
I've generated some dummy data values to test our procedure but with a few small modifications it will work on any dataset.
The procedure is extremely simple:
- Find the mean values of x (independent variable) and y (dependent variable)
- Using these values find the standard deviation and variance
- Find covariance of x and y
- Then find the correlation between these (this tells us whether x and y are strongly/weakly related)
- Now we calculate the coefficients in the line equation
- Predict new values of y (I've done it for the same dataset)
- Check the error between new y and old y values (using root mean squared error measure)
mean(x) = sum(x)/count(x)
std_dev(x) = sqrt{ [(x - mean(x))^2] / (count(x) - 1) }
Covariance = sum( [x-x_mean] * [y-y_mean] ) / (count(x) - 1)
Using the Pearson correlation formula.
r = covariance / [std_dev(x) * std_dev(y)]
y = b0 + b1(x) b1 = covariance/std_dev(x) b0 = mean(y) - b1*mean(x)
Error = sqrt{ sum( [predicted_y - actual_y)^2] ) } / count(y)