Source code: Use three Java files that predict heart disease for Cleveland dataset. There are three implementations of the algorithm:
- Logistic regression bases: it doesn't work good because the dataset is categorical
- Naive Bayes: It can predict the chances of heart disease. However, the prediction accuracy is not that good.
- Random Forest: Works pretty well. And you should find out why?
You can refer my book "Large Scale Machine Learning with Spark" at https://www.packtpub.com/big-data-and-business-intelligence/large-scale-machine-learning-spark
Finally, you should reuse the attached Maven friendly pom.xml file for your project setup. You can also change Spark or other dependency versions if you want.
Dataset: Use the dataset named as "processed_cleveland.data".