/CS109_crunchbase

Primary LanguageJupyter Notebook

CS109_crunchbase

We are an analytics firm that provides consulting services for investors based on our data science expertise. Unfortunately there is no way to know ahead of time which companies will succeed or fail, however, we can try to predict success based on the huge amounts of data available online about startups. For this project we will be analyzing data obtained through the CrunchBase API.

You can learn more about us and play with our visualization on our website:

http://nicodri.github.io/CS109_crunchbase/

or watch our video:

https://www.youtube.com/watch?v=M5FSEExBVDs

Table of Contents of the Process Notebooks

Data Collection Notebook

Used to pull data from the CrunchBase API.

Scraping Data

  • Organization-List
  • Excel-API
  • Relationships

Ensemble Analysis Notebook

Used to analyze the CrunchBase data by building individual Models and combining them into an ensemble.

Predicting Startup Success

  • Data-Cleaning
  • Exploratory-Data-Analysis
  • Bring out the Models
    • The Baseline Model
    • K-Nearest Neighbors
    • Logistic Regression
    • SVM
    • Naive Bayes
    • Random Forests
  • Building an Ensemble
  • ROC/Profit Curves

Similarity Graph Notebook

Used to build a similarity graph of the companies.

Similarity Graph

  • Formating the Data
    • Loading Data
    • Dimensionality Reduction
  • Distance Matrix
    • Closest Neighbors
    • Multi-Dimensional Scaling (MDS)
  • Unsupervised Learning
    • k Means
    • Gaussian Mixture Models
    • Results
  • Tuned Similarity Mapping
    • Competitors Graph
    • Weighted Graph

System Requirements

We developed a Python process using Python 2.7.9 on OS X.

You need the following libraries to run the code:

  • numpy
  • pandas
  • scikit learn
  • networkx
  • scipy
  • json
  • requests

Reference

We would like to quote here the tools we use to build our website: