Alphabet Soup's business team is looking to predict where to make investments. With our knowledge of machine learning and neural networks, the purpose of this project is to use the provided dataset and help create a binary classifier capable of predicting whether applicants will be successful if funded by Alphabet Soup. The dataset contains 34,000 organizations that have received Alphabet Soup funding. First, for Deliverable 1, a dataframe was created and variables were considered for the target(s) of the model and variables were considered for the feature(s). Then, the data was preprocessed to remove unnecessary columns and determine which columns could benefit from "binning" by analyzing the unique values of the columns. Once the categorical variables were determined, they were encoded using one-hot encoding and placed in a new dataframe, which was then merged with the original dataframe. For Deliverable 2, the new dataframe was compiled, trained, and evaluated using machine learning and deep learning neural networks. Lastly, for Deliverable 3, the model was put through several optimization techniques to try and reach a higher level of accuracy. The results of the various techniques are discussed below.
- What variables are considered the target(s) for your model?
The variable that was used as the target was IS_SUCCESSFUL. We ultimately want to determine the model's accuracy of predicting successful applicants funded by Alphabet Soup.
- What variables are considered to be the features of your model?
The variables that were considered to be the features of our model were: APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, USE_CASE, ORGANIZATION, INCOME_AMT, SPECIAL_CONSIDERATIONS.
- What variables are neither the targets nor features, and should be removed from the input data?
Initially, EIN and NAME were removed from the input data as variables that would neither be considered targets nor features.
- How many neurons, layers, and activation functions did you select for your neural network model and why?
Initially, I included 2 input layers with 10 neurons for the first layer and 5 neurons for the second layer. Relu was used as the activation function for the 2 input layers and sigmoid was used as the activation function for the output layer. I chose these initially because they were similar to those used in the training module.
- Were you able to achieve the target model performance?
No. The accuracy was around 72% - 73% which is below the target model performance of 75 %.
- What steps did you take to try and increase the model performance?
I took several steps to try and increase the model performance. These included:
- Adjusting the input data to remove additional unnecessary variables.
- Adding more neurons
- Adding more layers.
- Changing the activation function.
I actually tried many combinations of each of these, as well as some not listed, including adjusting the numbers and sizes of some of the bins. I have included an example of a final code in this repository that includes at least 3 attempted techniques at optimization. The file is called AlphabetSoupCharity_Optimization.ipynb and you can see the code here.
AlphabetSoupCharity_Optimiztion.
The evaluation results were similar to all other attempts.
Overall, I was unable to reach the desired performance of 75%, even when attempting several optimization techniques. In summary, I do not think this model can reach higher than 72%-73%. It seems as though the model is actually very overfitted. Perhaps a Supervised machine learning technique could provide better performance. We have compared Random Forest technique to Deep Learning with favorable results leaning towards Random Forest, especially due to its speed. My suggestion would be to try a Random Forest model. Additionally, I have tried exploring skilearn's make_moons module and kerastuner to find optimal parameters using dummy data but have not been able to produce increased performance results. I do not believe I am applying it correctly. This may be worth looking into further.