It is our Big-Data course project, using MR techniques. Aiming to collect conclusions and drive a managerial decision based on analyzing numerous software applications deployed on Google's Play Store. Our preference framework was Spark, hence its Python implementation, Pyspark. Our data entered a very long pipeline:- cleaning, preprocessing, transformations, EDA, a lot of Map-Reduces, heavy clustering, AI modeling and finally Decision Making. The dataset didn’t meet up to our expectations in various ways, but we worked our way around those stumps.
We want to choose its price that maximizes the company's profits, choose the best suitable developers, its stance regarding ads, and what exact category should we make the app for.
This dataset was scraped via a python script running on a cloud. (we didn’t scrape it, rather, we downloaded it from here.
- Data Preprocessing and cleansing
- Data Exploration (Involves visualization to extract knowledge from the data):
- Descriptive analysis: using Map Reduce.
- Diagnostic analysis: Using Pearson and Spearman’s correlation.
- Clustering to gain insights about data: Using K-means, K-Medoids or ISODATA.
- Model training and validation For Prediction and Classification: Using SVM, LR or Decision Trees, plus K-Fold.
- Art & Design
- Games
- Role-Playing
- Photography
- Comics
- PT. Teknologi Usaha Sukses Bersama
- Petar Marković
- Rmapps
- GameWriterStudio
- 인디사이드게임즈
- Ads are optional, but we prefer not to support ads…
- If the app is paid, it is better to keep the price under 4$
- Avg number of installs = 27k
- Avg Rating = 3.4 (assuming having more than 2000 critic)
- App’s price = 3.2$ (if it was free at launch)
With a confidence level of 99.999999% you will be a millionaire in just 3 hours 🐸
which you can also find in our document and our presentation:
- Collect the Dataset
- Install Pyspark and all its dependencies
- Preprocessing and Cleaning our dataset
- Perform EDA using Pyspark's low level Map-Reduce functions
- Use RDDs whenever possible
- Perform Diagnostic analysis given the previous EDA
- Answer some predictive questions
- Clustering
- ML Modelling
- Business intelligence