yaay_final-year-project: A Jupyter Notebook repository from sud2000

Out Heroku link: https://yaay-final-year.herokuapp.com/

Problem statement: Using data from the web, can we build models using supervised learning techniques to classify whether a startup will be successful?

I focused on companies founded within the last decade that had raised more than one round of funding. I narrowed the term "success" to mean IPO or getting acquired in this case and "failure" as closing. After tuning a logistic regression model, I deployed the model via a flask web app on Heroku. With thousands of companies' information I determined which factors could predict their success. Using the Crunchbase dataset with information on 20,000+ companies and all of their funding rounds, I looked at the following features:

Average money raised per funding round
Number of funding rounds
Average time between funding rounds
Time between seed and series A round
Country
State
Industry

I applied various classification algorithms, and I found that tree-based models like XGBoost as well as Logistic Regression performed the best.

Due to its interpretability and ability to quantitatively translate the inputs to the output, I deployed the Logistic Regression model. Using a probability threshold of 35%, I achieved an f_beta score of 0.85 with a beta value of 3. This places extra emphasis on recall because in the application of venture capital investments (the intended use case for this model), it is far more important to catch any potential "unicorns" even at the expense of investing in a few "duds".

Files

p03_Data_Cleaning.ipynb shows the process to clean all of the data and prepare relevant features for modeling

p03_Modeling.ipynb shows the process of training various classifiers and evaluating feature relationships and model performance

Web App files used to build the browser-based predictor tool hosted on Heroku here

Slides can be found here

sud2000/yaay_final-year-project

Files