We are a group of consultants for the City of Chicago and Chicago Department of Public Health (CDPH). Our task is to:
- predict when and where different species of mosquitoes will test positive for West Nile Virus; and
- Investigate if the effect of spray could be optimized.
In 2002, the first human case of West Nile Virus were reported in Chicago. By 2004 the City of Chicago and CDPH had established a comprehensive surveillance and control program that is still in effect today.
20% who are infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death.
Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations.
A more accurate method of predicting outbreaks of West Nile virus in mosquitoes will aid the City of Chicago and CDPH more efficiently and effectively allocate resources towards the prevention of the virus.
train.csv
,test.csv
: the training and test set of the main dataset. The training set consists of data from 2007, 2009, 2011, and 2013, while in the test set we are requested to predict the test results for 2008, 2010, 2012, and 2014.
Features | Description |
---|---|
Id | the id of the record |
Date | date that the WNV test is performed |
Address | approximate address of the location of trap. This is used to send to the GeoCoder |
Species | the species of mosquitos |
Block | block number of address |
Street | street name |
Trap | Id of the trap |
AddressNumberAndStreet | approximate address returned from GeoCoder |
Latitude, Longitude | Latitude and Longitude returned from GeoCoder |
AddressAccuracy | accuracy returned from GeoCoder |
NumMosquitos | number of mosquitoes caught in this trap |
WnvPresent | whether West Nile Virus was present in these mosquitos. 1 means WNV is present, and 0 means not present. |
spray.csv
: GIS data of spraying efforts in 2011 and 2013
Features | Description |
---|---|
Date, Time | the date and time of the spray |
Latitude, Longitude | the Latitude and Longitude of the spray |
-
weather.csv
: weather data from 2007 to 2014. Column descriptions innoaa_weather_qclcd_documentation.pdf
. -
sampleSubmission.csv
: a sample submission file in the correct format
The follow steps were taken:
- [Data Cleaning / Preprocessing / Feature Engineering](## Data Cleaning / Preprocessing / Feature Engineering)
- [Exploratory Data Analysis](## Exploratory Data Analysis)
- [Hyperparameter Tuning / Modelling](## Hyperparameter Tuning / Modelling)
- [Results](## Results)
- [Conclusions and Recommendations](## Conclusions and Recommendations)
The following steps was performed on all datasets:
- check for null and missing values; and
- set dates as index, and sorted by their dates
- CountVectorise
species
feature - include
unspecified
for training dataset - Split
train.csv
andtest.csv
based on weather stations
- Replace 1609 missing values in
CodeSum
feature with np.nan, then fillna(‘M’) for count vectorization - Used
count vectorization
to separate combinations, and account for individual codes within combinations - Duplicate sunrise, sunset values from station 1 to station 2
- Split
weather.csv
into station 1 and station 2
- Created new feature
spray
using spray dates and locations. - For test set,
spray
feature was assigned zero values to signify no spray (no spray data for test set dates)
All the datasets were then merged together using train/test date index since train/test date range is narrower.
- Separate dates into new
day
,month
,year
features - Add new dates features to both train and test sets to make sure features match
test.csv
has traps not found intrain.csv
. Added traps to train data, to match test data.- SMOTE the training data to account for data imbalance
Five types of classifiers were used in this project, followed by stacking of the top 3 models. Through GridSearch, the models were tuned with certain hyperparameters.
- Random Forest
- Gradient Boosting
- XGBoost
- LGBM
- Neural Network
- Stacking (XG Boost, LGBM, Gradient Boost)
Model | Hyperparameters | Test Accuracy |
---|---|---|
XGboost | Learning_rate = 0.2, Objective=binary:logistic, Max_depth = 4 , N_estimators = 300, tree_method = hist | 0.929 |
LGBM | objective = binary, N_estimators = 500, learning_rate = 0.01, num_leaves = 50, max_depth = 35, reg_alpha = 0.3 | 0.914 |
Gradient Boost | n_estimators = 400, max_depth = 10, min_samples_split = 2, max_features = log2, learning_rate = 0.15 | 0.931 |
Random Forest | n_estimators = 1600, max_depth = 10, min_samples_split = 10, min_samples_leaf = 5 | 0.788 |
Neural Network | Input layer = 100 neurons, activation ‘relu’, hidden layer = 1 (100 neurons, activation ‘relu’), Output activation ‘sigmoid’, Optimizer : Adam, Loss: binary_crossentropy | 0.873 |
Stacking | K-folds = 2, Meta Model = Logistic Regression | 0.933 |
We have determined that the accuracy in prediciting the 'True Positive' is the most important factor and would prefer a high recall and F1 score for positive class. Hence, the we altered the threshold to increase the recall and F1 score.
Metrics | Treshold = 0.5 | Treshold = 0.1 | Changes |
---|---|---|---|
Accuracy | 0.93 | 0.89 | -4.3% |
F1 Score | 0.28 | 0.3 | 7% |
Recall | 0.26 | 0.46 | 77% |
Based on our findings, we determined that the 3 most important features for the analysis are:
-
- Temperature
- Sunrise/Sunset
- Standard Pressure
-
- Latitude
- Longitude
-
- Culex Restuans
- Culex Pipiens
The sprays were conducted in August 2013 (during the peak month) but they were found to be ineffective. Mosquitoes tend to be most active in the early morning and this also means that mosquitoes tend to feed on birds, which is the main source of transmission.
To optimise the spraying time, we can alter the spraying period from August to June / July instead as these months have longer days and earlier sunrise.
In order to optimise the spraying location, we can target locations present with high clusters of West Nile Virus. The top 10% of the clustered locations are found to have four times more chances of positive detection that randomly selecting traps locations.
This means that there is also a cost savings of about 75% (assuming spraying is effective).
The spike of the number of mosquito does not match with any event of spray, we also cannot observe any significant improvement on the weeks after spraying. Besides that, our team will need more data for more detail analysis in the future especially the actual data for the next 12 months after spraying. Hence, we concluded that the effect of spray in 2011 and 2013 are inconculsive.
The number of mosquitos per year in the dataset is not constant, as there are more traps placed in 2007 and 2013 than 2009 and 2011. This would give an impression that there are more mosqutios caught in those years. The spraying is also not done at the same location and time, which would not be able to give a more comparsion between the years of whether spraying is effective or not.
In future, we encourage the data collection team to collect more data so that we can generate more insights on the result of spraying. E.g.: at least 12-24 months of data after the spray. It will also be good to understand the reason to why less traps were placed in 2009 and 2011, as this would give a better understanding of the number of mosquitos in those years. As the number of mosquitos is affected by the number of traps placed.
However, we are able to provide ideas to target the West Niles Virus base on the trends and data science analysis from our model.
- The weather does affect the number of mosquitos, as the number of mosquitos is mainly affected by the temperatures, humidity and the amount of day time.
- By looking at the period of coverage of the spray, it was pretty accurate on the peak period of WNV (July to September).But it is still less effective agaisnt the West Nile Virus.
- As from the historical data, the everage sunrise time in June and July are the earliest and the daytime are the longest.
- Our team's hypothesis would be:
- The mosquito mostly feed on bird during that period due to less human active in earlier daylight
- The peak of transmission of virus within the mosquito happen during June to July
- While the data shows the mosquito are usually active in the morning (Sunrise)
- Hence, our team suggest to shift the spray period to June - August
- According to our lift curve in section (4.2 Model Evaluation), we have 4 times effective spray at the top 10% location than randomly spray.
- By targetting on the top 10% location,, the cost can reduced to 0.25X.
- This will result in cost optimisation about 75%.
The insecticide can be genetically engineered to target the particular species that have the highest possibility of present the virus. The species is: Culex Restuans/Pipiens. In fact West Nile virus (WNV) is a mosquito-borne virus that is mainly transmitted among birds by Culex pipiens mosquitoes. The cost saved from spraying on the more critical location can be used in R&D to develope a more effective insecticide
Lastly, the base boosting models: Gradient Boost and Adaboost could be the best fit for this data science problem. Stacking of models is not worth the engineering resources due to time taken and the insignificant improvement.