CS109_crunchbase

We are an analytics firm that provides consulting services for investors based on our data science expertise. Unfortunately there is no way to know ahead of time which companies will succeed or fail, however, we can try to predict success based on the huge amounts of data available online about startups. For this project we will be analyzing data obtained through the CrunchBase API.

You can learn more about us and play with our visualization on our website:

http://nicodri.github.io/CS109_crunchbase/

or watch our video:

https://www.youtube.com/watch?v=M5FSEExBVDs

Table of Contents of the Process Notebooks

Data Collection Notebook

Used to pull data from the CrunchBase API.

Scraping Data

Organization-List
Excel-API
Relationships

Ensemble Analysis Notebook

Used to analyze the CrunchBase data by building individual Models and combining them into an ensemble.

Predicting Startup Success

Data-Cleaning
Exploratory-Data-Analysis
Bring out the Models
- The Baseline Model
- K-Nearest Neighbors
- Logistic Regression
- SVM
- Naive Bayes
- Random Forests
Building an Ensemble
ROC/Profit Curves

Similarity Graph Notebook

Used to build a similarity graph of the companies.

Similarity Graph

Formating the Data
- Loading Data
- Dimensionality Reduction
Distance Matrix
- Closest Neighbors
- Multi-Dimensional Scaling (MDS)
Unsupervised Learning
- k Means
- Gaussian Mixture Models
- Results
Tuned Similarity Mapping
- Competitors Graph
- Weighted Graph

System Requirements

We developed a Python process using Python 2.7.9 on OS X.

You need the following libraries to run the code:

numpy
pandas
scikit learn
networkx
scipy
json
requests

Reference

We would like to quote here the tools we use to build our website:

Peter Finlan for the website template
Canvasjs for the slider animation
Mapbox for the map
Alchemy.j for the nodes graph