A Linear Regression model was built to predict the rating and success of an application based on several factors of an application.
Dataset Description
The dataset, ‘Google Play Store Apps’ was obtained from Kaggle and used for this
study.
• No of observation (rows): 10840
• Attributes (columns): 13
Independent variables:
i. App: This contains the application name
ii. Category: Category of the app
iii. Reviews: No. of user reviews
iv. Size: Size of the app
v. Installs: Number of user installs
vi. Type: Paid or Free
vii. Price: Price of the app
viii. Content Rating: Age group the app is targeted at - Children / Mature 21+ /
Adult
ix. Genres: multiple genres (For eg, a game can belong to Music, Game, Family
genres.
x. Last Updated: Date when the app was last updated
xi. Current Ver: Current version of the app available on Play Store
xii. Android Ver: Min required Android version
Dependent variable:
Rating: Overall user rating of the app
Approach Used in the Project
In this report we would be solving the problem with two methods using, Scikit -learn
and statsmodel.
• We will start by fitting the model using SKLearn. After we fit the model, unlike
with statsmodels, SKLearn does not automatically print the concepts or have
a method like summary. So we have to print the coefficients,intercepts etc.
separately.
• After fitting the model with SKLearn, we fit the model using statsmodels.
Unlike SKLearn, statsmodels doesn’t automatically fit a constant, so you need
to use the method sm.add_constant(X) in order to add a constant. Adding a
constant, while not necessary, makes your line fit much better.
• Coefficients can be obtained pretty easily from SKLearn, so the main benefit
of statsmodels is the other statistics it provides.
• One of the assumptions of a simple linear regression model is normality of our
data.The statistics in the summary table in statsmodel are testing the
normality of our data.
• If the Prob(Omnibus) is very small, and we took this to mean <.05 as this is
standard statistical practice, then our data is not normal. This is a more
precise way than graphing our data to determine if our data is normal.
Therefore, SKLearn has more useful features, but statsmodels is a good method to
analyze your data before
For detailed idea about the project please check the attached pdf "Report Data Science Final" in the repository.
anushka012399/Mobile_App_Success_Prediction
Repository used to store a project done as a part of 'Fundamentals of Data Science' course.
Jupyter Notebook