/second_EDA_Project

After 4 weeks we started with a second EDA project. In this case we worked as a team.

Primary LanguageJupyter Notebook

Image source

Crowdfunding with kickstarter

Kickstarter is a US based global crowdfunding platform focused on bringing funding to creative projects. Since the platform’s launch in 2009, the site has hosted over 189.131 successfully funded projects with over 5.316.550.434 $ and 18.623.561 million unique backers. Kickstarter uses an “all-or-nothing” funding system. This means that funds are only dispersed for projects that meet the original funding goal set by the creator.

Unfortunately, not every campaign comes out as a winner.

This is an exploratory data analysis of the kickstarter projects dataset for predicting if a kickstarter project is going to be successful or not.

Data & SetUp

The dataset for this project was given by our head coaches. 55 csv files were loaded and combined into a single data frame using the pd.read_csv function in python. The raw dataset could be found here.

Get a feeling for my data & business understanding

First of all we startedt to develope a rare understanding for the data & business case. For example we checked the shape, null values und the columns. To get an better feeling, we took our time to check each column and tried to figure out the meaning. After that we started with the online research to get some background information about kickstarter.

Data Preparation & Exploration

In the second step it took quiet a while to explore all our features and our target variable (state: successful or failed). In this notebook you can find everything in detail.

Modelling & Hyperparameter Optimization

Before we started with different models a feature selection was performed to select the predictors which play a significant role in explaining if a project is going to be successful or not. In this notbook you can find two different models (logistic regression and random forest) as our basemodel. And in this notebook you can find different models (AdaBoost, XGBoost. Random Forest, Logistic Regression and SVM on our final model. For each method, we tweaked the value of parameters to get the best set out of it.

Future work

  • text length and typos of name, projectinfo (blurb) and slug
  • check the medium (photo)
  • check the city (location)
  • further hyperparameter tuning

Alright...let's dive in!