/Disease-Prediction-from-Symptoms

A data mining application to predict disease using symptom data. A model is built using both Multinomial Naive-Bayes & Decision Tree Algorithm to predict the disease given the symptoms.

Primary LanguageJupyter NotebookMIT LicenseMIT

Disease-Prediction

Disease Prediction from Symptoms

License: MIT

A data mining application to predict disease using symptom data i.e. Prognosis. To develop this application, we used the Columbia University dataset and build a model using both Multinomial Naive-Bayes and Decision Tree Algorithm to predict the disease given the symptoms observed in a person.


Columbia University Dataset

  • This dataset is a knowledge database of disease-symptom associations generated by an automated method based on information in textual discharge summaries of patients at New York-Presbyterian Hospital admitted during 2004. The dataset can be found here.

  • The first column shows the disease, the second the number of discharge summaries containing a positive and current mention of the disease, and the associated symptom.

dataset


Tasks performed

  • Data extraction and cleaning : Basic cleaning, segmentation of columns and string formatting were performed in Excel.
  • Data preprocessing : Data preprocessing tasks performed include:
    • Spelling mistakes in the names of diseases or symptoms or their codes was rectified
    • The codes which were given to diseases and symptoms were removed as they were irrelevant for our task
    • A cumulative list of all symptoms was made
    • Each symptom was assigned a Boolean value of 0 or 1 for each disease, according to whether the symptom occurs with the disease or not
  • Data visualization : Built correlation heatmaps for relationship between the symptoms and relationship between the diseases
  • Model Building : Used 2 algorithms for this dataset and compared the results to evaluate which one yielded better results: Multinomial Naive Bayes Classifier and Decision Tree.

Find the detailed documentation here.


Results

The results of all the tasks can be viewed by running this code in Google Collab or in the detailed documentation above.

The entire decision tree is too big to be inserted here, so only a part of it is shown here. The entire decision tree can be found here.

dataset


Contributors

Mihir Gandhi - mihir-m-gandhi

Jasdeep Singh Grover - jasdeep100

Hardik Chodvadiya - willyhardik

Amit Dave - amitdave1998


License

This project is licensed under the MIT - see the LICENSE file for details.