Folder | Description |
---|---|
Code | Python code related to the analysis |
Data | Data used in the analysis |
Images | .PNG files used in the final report |
Reports | Documentation/write-up of the final analysis |
For the final project, we will be analyzing a dataset from a nonprofit dedicated to rescuing homeless, neglected, and abandoned animals from euthanasia in kill shelters and getting them adopted into their forever homes. The nonprofit educates the community and all pet parents on responsible pet parenting, including the importance of spay/neuter, obedience training, and good nutrition.
Successful adoptions of cats and dogs depend on a variety of factors, both on the parts of the pet and of the family. The nonprofit helps bridge the gap, building strong relationships between the two. The nonprofit rescues hundreds of animals every year, provides them with loving temporary care, and finds them well-matched, carefully screened forever homes.
One of the struggles that the nonprofit runs into is pets being returned after adoption. Even though they adopt out ~2000 dogs per year, about 10% get returned for a variety of factors. The nonprofit assiduously tracks information on all their adoptions and returns. With that, we will use various analytical methods learned in Computational Data Analysis to help the nonprofit predict whether or not a dog that they adopt out will be returned.
The project will be overseen by the Program Manager for Volunteers and Data Integrity at the nonprofit. She will provide the data needed for this analysis to we. Data will be provided using two official sources from the nonprofit, both of which contain records over the past decade, and can be easily linked by a unique ID field for each dog represented in each dataset. The two main data sources are:
Dog List | A list of every dog that has been adopted out by the nonprofit over the past 10 years. This list includes a variety of features describing each dog, where each row corresponds to a dog being adopted out, and each column represents a different attribute related to that dog. This dataset is composed of 10 spreadsheets, one for every year over the past decade. Each spreadsheet has records of ~2000 adoptions per year. The attributes in this dataset are as follows:
Field | Description |
---|---|
Dog Name | Name of the adopted dog |
ID | Unique ID corresponding to that dog |
Link | Link to dog’s profile on the nonprofit's site |
Foster/Boarding | Type of adoption for the dog (short vs. long-term) |
Sex | Gender of dog |
Age | Estimated age (sometimes a range if actual age not known) of dog at time of adoption |
Weight | Estimated weight of dog at time of adoption |
Breed Mixes | Type of dog |
Color | Color of dog |
Behavioral Notes | Free text field describing the behavior of the dog. There’s some consistency in entries depending on the individual entering this field, along with some key words to describe the dogs behavior around others |
Dogs in Home | Number of other dogs in adoptee’s house, if any at all |
Cats in Home | Number of cats in adoptee’s house, if any at all |
Kids | Number of kids in adoptee’s house, if any at all |
BS/W | Indicator for whether or not the dog is part pitbull or other ‘bully’ breed |
Medical Notes | Free text field describing any health conditions of the dog, if any exist at all |
Transport Date | Date of adoption |
Returns List | A list of every dog that’s been returned after being adopted out by the nonprofit. This list also includes a variety of features describing each dog, where each row corresponds to a dog being adopted out, and each column represents a different attribute related to that dog. This dataset is composed of 10 spreadsheets, one for every year over the past decade. Each spreadsheet has records of ~200 returns per year.
Field | Description |
---|---|
Dog Name | Name of the adopted dog |
ID | Unique ID corresponding to that dog |
Dog Info | Free text field with general information on the dog |
Reason for Return | Free text field describing the reason the dog was returned by the adopters. There’s some consistency in this field, but it generally depends on the person entering the information into the spreadsheet |
Behavior with Dogs | Free text field describing the dog’s general behavior around other dogs |
Behavior with Kids | Free text field describing the dog’s general behavior around kids |
Behavior with Cats | Free text field describing the dog’s general behavior around cats |
Energy Level | General energy of the dog |
Socialization/Daycare | Categorical field for whether the dog was in a daytime program for dogs |
Vetting | Categorical field for whether the dog was being taken to the vet |
Date of Adoption | Date dog was adopted |
Previous Return? | Boolean for whether the dog has been returned in the past |
Previous Return Info | Free text field describing why the dog had been returned in the past |
Date of Return | Date dog was returned |
Type | Boolean for dog vs. puppy |
The goal of this project is to help keep dogs adopted by identifying dogs that are at risk of being returned. We will do this by building classifier models that classify the adopted dogs on whether they are returned vs. stay adopted, using the historical data we have about each dog.
To begin, we will have to process the data as it is not fully prepared for data analysis. Currently, all data is kept in a different spreadsheet by year (i.e. adoptions in 2021 are in the ‘Dog List 2021’; so we’ll have to programmatically combine all of our datasets. On top of that, the Dog and Returns Lists are kept as two separate data sources, but share a linking variable ‘ID’ for dogs that were returned. We’ll have to use a Left Join to find dogs that were returned; for any dogs which don’t have a match in the join, we’ll flag those as ‘Not Returned’.
There are also some issues with the features provided in the data. For example, the feature for ‘age’ is not standardized (written as weeks, months, or years) and is sometimes even written as an interval if the dog’s age is unknown, so we will need to standardize those features using one age marker. Additionally, some factors are described via text. For example, high energy dogs have the ‘description’ “walks not enough, needs play” - so we will attempt to identify those characteristics by scraping the text and then marking them as binary variables. We will also need to read in features like ‘breed’ as factor variables, in order for the model to read them correctly. Following this initial preparation, we will standardize the numerical data using a python data scaling package to ensure certain features are not biased, such as weight vs. age.
After preparing the data, we will analyze the data using the various classification models that we have learned so far in class including, but not limited to, Naïve Bayes, K Nearest Neighbors, SVM, Logistic Regression, and Neural Networks. We will split the data into training, test, and validation sets, as we are using multiple models. By testing each of these methods, we hope to identify which one is most successful in classifying the adopted dogs as returned vs. not-returned.
To evaluate our models, we will look at their classification/misclassification rates using a confusion matrix, as well as their precision, recall, and F1 scores. These metrics are essential in judging the outcome of a classification model, which is why we will use it to evaluate ours. We will compare the results of our test and validation sets in order to determine which model is the most accurate. We’ll also take note of the performance efficiency of each model.
If our models are accurate, they can be integrated into the nonprofit’s systems to classify new dogs that come into the nonprofit. If a dog is classified into the return group, the nonprofit can ensure that the dog is well matched with its new adopter, and reach out to the adopter to provide additional support to prevent the dog from being returned. We plan to construct this project so that the nonprofit’s data team can continue to use it, helping to place more rescued dogs in loving homes and keep them there.