DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
- How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible
- How to increase the consistency of project vetting across different volunteers to improve the experience for teachers
- How to focus volunteer time on the applications that need the most assistance
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
The train.csv
data set provided by DonorsChoose contains the following features:
Feature | Description |
---|---|
project_id |
A unique identifier for the proposed project. Example: p036502 |
project_title |
Title of the project. Examples:
|
project_grade_category |
Grade level of students for which the project is targeted. One of the following enumerated values:
|
project_subject_categories |
One or more (comma-separated) subject categories for the project from the following enumerated list of values:
Examples:
|
school_state |
State where school is located (Two-letter U.S. postal code). Example: WY |
project_subject_subcategories |
One or more (comma-separated) subject subcategories for the project. Examples:
|
project_resource_summary |
An explanation of the resources needed for the project. Example:
|
project_essay_1 |
First application essay* |
project_essay_2 |
Second application essay* |
project_essay_3 |
Third application essay* |
project_essay_4 |
Fourth application essay* |
project_submitted_datetime |
Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245 |
teacher_id |
A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56 |
teacher_prefix |
Teacher's title. One of the following enumerated values:
|
teacher_number_of_previously_posted_projects |
Number of project applications previously submitted by the same teacher. Example: 2 |
* See the section Notes on the Essay Data for more details about these features.
Additionally, the resources.csv
data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:
Feature | Description |
---|---|
id |
A project_id value from the train.csv file. Example: p036502 |
description |
Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25 |
quantity |
Quantity of the resource required. Example: 3 |
price |
Price of the resource required. Example: 9.95 |
Note: Many projects require multiple resources. The id
value corresponds to a project_id
in train.csv, so you use it as a key to retrieve all resources needed for a project:
The data set contains the following label (the value you will attempt to predict):
Label | Description |
---|---|
project_is_approved |
A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved. |
- school state
- teacher_prefix
- project_grade_category
- clean_categories
- clean_subcategories
We use one hot encoding vectors to represent all the categorical features
- price
- teacher_number_of_previously_posted_projects
As we are training Decision Tree Model, we haven't used any transformation on numerical data as Decision Tree is not affected by variance of numerical features
- Title
- essay
For Title and Essay we used two different factorizing method
- TfIdf vectorized
- we only consider words which appear in at least 10 essays to reduce the size of dimensions.
- After fitting on train_data dimension of dataset were : 16623 dimensions for each essay
- TfIdf weighted W2V
- used pretrained Glove (300 dimension) vectors train on very large corpus size.
Vectorizer | Model | Best Hyper parameters | AUC |
---|---|---|---|
BOW | Naive Bayes | Alpha : 0.1 | 0.6299 |
TFIDF | Naive Bayes | Alpha : 0.0001 | 0.5220 |
TFIDF | Decision Tree | max_depth : 10, Min_sample_split = 500 | 0.6486 |
TFIDF_weighted W2V | Decision Tree | max_depth : 10, Min_sample_split = 500 | 0.6374 |
TFIDF | DT with Nonzero feature importance | max_depth : 10, Min_sample_split = 500 | 0.6496 |
TFIDF | GBDT | max_depth : 3, n_estimators = 100 | 0.7263 |
TFIDF_weighted W2V | GBDT | max_depth : 3, n_estimators = 100 | 0.7137 |
W2V, Custom Trainable Embedding layer for Categorical features | Model 1 LSTM | lr : 0.001, Adam, Batch size= 256, epoch = 20 | 0.7243 |
Removed Low and High IDF words,W2V, Custom Trainable Embedding layer for Categorical features | Model 2 LSTM | lr : 0.001, Adam, Batch size= 256, epoch = 20 | 0.7385 |
W2V, One Hot Encoding For Categorical features | Model 3 LSTM | lr : 0.001, Adam, Batch size= 512, epoch = 20 | 0.7584 |
- Trained Decision Tree Model by combining all the numerical, categorical and text features into combined dataset.
- used GridSearchCv for hyperparameter tuning for max_depth, and min_sample_split parameter.
- after training Decision Tree, I trained another decision Tree but we only used all the features having nonzero feature importance.
- after training, I analyse the results of all the false positive test data points using word cloud, price distribution
- Trained Gradient Boosted Decision Tree Model by combining all the numerical, categorical and text features into combined dataset.
- used GridSearchCv for hyperparameter tuning for max_depth, and number of tree estimator parameter.
- I trained 3 different lstm based model, for all the models, I used Glove vector based embedding layers, essay embedding layer is non-trainable.
- For Model 1, I transformed every categorical columns to custom embedding layer.
- For Model 2, I removed low idf and high idf value words from train data and used same architecture as model 1.
- For Model 3 I encoded every categorical column using one hot encoded layer, and used Conv1D and combination of LSTM cells.
-
I used TF-IDF vectorizer on the Train data
-
Get the idf value for each word we have in the train data.
-
I did some analysis on the Idf values and based on those values I choose the low and high threshold value. Because very frequent words and very very rare words don't give much information. 4.Removed the low idf value and high idf value words from the train and test data. I go through each of the sentence of train and test data and include only those features(words) which are present in the defined IDF range.
-
Perform tokenization on the modified text data same as you have done for previous model.
-
Create embedding matrix for model 2 and then use the rest of the features similar to previous model.
-
Define the model, compile and fit the model. (used same architecture as model1)
Vectorizer | Model | Best Hyper parameters | AUC |
---|---|---|---|
BOW | Naive Bayes | Alpha : 0.1 | 0.6299 |
TFIDF | Naive Bayes | Alpha : 0.0001 | 0.5220 |
TFIDF | Decision Tree | max_depth : 10, Min_sample_split = 500 | 0.6486 |
TFIDF_weighted W2V | Decision Tree | max_depth : 10, Min_sample_split = 500 | 0.6374 |
TFIDF | DT with Nonzero feature importance | max_depth : 10, Min_sample_split = 500 | 0.6496 |
TFIDF | GBDT | max_depth : 3, n_estimators = 100 | 0.7263 |
TFIDF_weighted W2V | GBDT | max_depth : 3, n_estimators = 100 | 0.7137 |
W2V, Custom Trainable Embedding layer for Categorical features | Model 1 LSTM | lr : 0.001, Adam, Batch size= 256, epoch = 20 | 0.7243 |
Removed Low and High IDF words,W2V, Custom Trainable Embedding layer for Categorical features | Model 2 LSTM | lr : 0.001, Adam, Batch size= 256, epoch = 20 | 0.7385 |
W2V, One Hot Encoding For Categorical features | Model 3 LSTM | lr : 0.001, Adam, Batch size= 512, epoch = 20 | 0.7584 |