Stat897 – Applied Data Mining & Statistical Learning

Team Project

This project is to be completed in your group. Project is due Monday August 7

Report to be submitted in PDF format only. You need to submit an .xls/.xlsx file also

A charitable organization wishes to develop a data-mining model to improve the cost-effectiveness of their direct marketing campaigns to previous donors. According to their recent mailing records, the typical overall response rate is 10%. Out of those who respond (donate), the average donation is $14.50. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $2 to produce and send. Since expected profit from each mailing is 14.5 * 0.1 – 2 = –$0.55, it is not cost effective to mail everyone. We would like to develop a classification model using data from the most recent campaign that can effectively capture likely donors so that the expected net profit is maximized. The entire dataset consists of 3984 training observations, 2018 validation observations, and 2007 test observations. Weighted sampling has been used, overrepresenting the responders so that the training and validation samples have approximately equal numbers of donors and non-donors. The response rate in the test sample has the more typical 10% response rate. We would also like to build a model to predict donation amounts for donors – the data for this will consist of the records for donors only. The data are available in the file “charity.csv”.

ID number [Do NOT use this as a predictor variable in any models]
REG1, REG2, REG3, REG4: Region (There are five geographic regions; only four are needed for analysis since if a potential donor falls into none of the four he or she must be in the other region. Inclusion of all five indicator variables would be redundant and cause some modeling techniques to fail. A “1” indicates the potential donor belongs to this region.)
HOME: (1 = homeowner, 0 = not a homeowner)
CHLD: Number of children
HINC: Household income (7 categories)
GENF: Gender (0 = Male, 1 = Female)
WRAT: Wealth Rating (Wealth rating uses median family income and population statistics from each area to index relative wealth within each state. The segments are denoted 0-9, with 9 being the highest wealth group and 0 being the lowest.)
AVHV: Average Home Value in potential donor's neighborhood in $ thousands
INCM: Median Family Income in potential donor's neighborhood in $ thousands
INCA: Average Family Income in potential donor's neighborhood in $ thousands
PLOW: Percent categorized as “low income” in potential donor's neighborhood
NPRO: Lifetime number of promotions received to date
TGIF: Dollar amount of lifetime gifts to date
LGIF: Dollar amount of largest gift to date
RGIF: Dollar amount of most recent gift
TDON: Number of months since last donation
TLAG: Number of months between first and second gift
AGIF: Average dollar amount of gifts to date
DONR: Classification Response Variable (1 = Donor, 0 = Non-donor)
DAMT: Prediction Response Variable (Donation Amount in $).

The DONR and DAMT variables are set to “NA” for the test set. Use the guidelines provided in the R script file “TeamProjectEx.R” to fulfill the following requirements.

Project Requirements

[Note: To help you with coding sample codes are provided in TeamProjectEx.R. You may want to modify it or write your own codes as per project requirement. You may not be able to use the code as is. ]

Develop a classification model for the DONR variable using any of the variables as predictors (except ID and DAMT). Fit all candidate models using the training data and evaluate the fitted models using the validation data. Use “maximum profit” as the evaluation criteria and use your final selected classification model to classify DONR responses in the test dataset (the R script file “TeamProjectEx.R” provides details).
Develop a prediction model for the DAMT variable using any (or all) of the variables as predictors (except ID and DONR). Use only the data records for which DONR = 1. Fit all candidate models using the training data and evaluate the fitted models using the validation data. Use “mean prediction error” as the evaluation criteria and use your final selected prediction model to predict DAMT responses in the test dataset (the R script file “TeamProjectEx.R” provides details).
Save your test set classifications and predictions into a .xls/.xlsx file and one person from the team should submit this by the project deadline. Your test set classifications and predictions will be compared with the actual test set values of DONR and DAMT.
One person from the team should also submit the project report by the project deadline.

Write up your results in a professional report

The report should be no more than 10 single-spaced pages long.
It should include all substantive details of your analyses (the key word here is “substantive”).
The report should have sections (e.g., Introduction, Analysis, Results, Conclusion) and provide sufficient details that anyone with a reasonable statistics background could understand exactly what you’ve done.
Feel free to briefly mention any exploratory aspects from your analyses, but do not devote a lot of space to discussions of dead-ends, pursuit of unproductive ideas, coding problems, etc.
Consider using tables and figures to enhance your report.
Do not embed R code in the body of your report; instead attach the code in an appendix. The appendix does not count towards the page limit.

Grading criteria (out of 30)

12 marks based on the profit you achieve for your classification model on the test set.
12 marks based on the mean prediction error you achieve for your prediction model on the test set.
6 marks for the quality of your report (including: clarity of writing, organization, and layout; appropriate use of tables and figures; careful proof-reading to minimize (not necessarily eliminate) typos, incorrect spelling, and grammatical errors; adherence to report guidelines above).

Hints