- Background
- Objective
- Tools and Packages
- Data Visualization
- Results
- References
- Challenges and Future Work
Machine learning is a subset of artificial intelligence utilizing mathematical and statistical methods to identify patterns in data in an automated fashion. Numerous aspects of clinical practice lend themselves to computational tools to assess disease pathology, identify anomalies, triage critical patients, and various other tasks, but the scope of this article is limited to supervised learning to constrain the discussion, develop concrete examples, and because it represents the majority of clinical machine learning research.
In the context of supervised machine learning, models are fit to data, thereby learning relationships between input features and output targets. Input data represent digital encodings of, for example, X-rays, lab tests, electrocardiograms, or various other clinical data streams. The output could be a diagnostic label, a region of interest, length of stay, etc. For pedagogical ease, throughout this article, the classification of lung nodules will be used as a reference example.
The inputs to this nodule classifier are computed tomography (CT) images, but other modalities could have been used (e.g., X-ray or ultrasound). Each input image is associated with a two-class binary label (i.e., 0 or 1, indicating the absence or presence of calcified nodules, respectively). There is nothing special about the binary label; in other clinical applications, the label could represent several discrete classes (e.g., different types of lung nodules or disease stages) or be a continuous output as in regression (e.g., length of hospital stay, lab tests with continuous ranges).
Once CT images and associated labels are sourced and validated, a trained model learns relations between the image features (e.g., edges, contours, etc.) and their binary class (i.e., a positive or negative finding). However, this trained model may have also learned idiosyncratic features specific to the provided image and label pairs, which are not generally true in other data from the same modality (in this case, CT images). This generalization brittleness occurs for many reasons, including equipment with different noise sources (across different manufacturers), out-of-calibration effects, selection bias, population differences, and many others. Building generalizable models is paramount to clinical research. After all, the radiologist who developed the training data/labels can go to another hospital and provide the same expertise. At the same time, a working model at one medical center can fail at another. Therefore, it becomes key to understand the issues that might arise during the model training, validation, and testing processes.
- To build disease classification models using Deep Neural networks and Random Forrest Classifier
- To preprocess images using CV2 and improve model performance
- To integrate trained models and create an app using flask
Task | Technique | Tools/Packages Used |
---|---|---|
Data Pre-processing and EDA | Image normaliaztion, Noise removal, Data Creation for Covid | cv2, shutil, sklearn, pandas, numpy |
Model Developement | feature_selection, model_selection, model construction, optimization, neural network tunning, performance evaluation | Tensorflow, xgboost, sklearn |
Data Visualization | Multi-attribute plots, heatmaps, correlation plots | matplotlib, seaborn |
Environments & Platforms | MS Excel, Jupyter Notebook, Tensorflow, Pycharm |
Output
Disease | Classifier Type | Accuracy |
---|---|---|
Pneumonia | CNN | 83.17% |
Heart Disease | XGBoost | 86.96% |
Diabetes | Random Forest | 89.8% |
Alzheimer | CNN | 83.54% |
Breast Cancer | Random Forest | 91.81% |
Brain Tumor | CNN, VGG16 | 96.5% |
COVID-19 | CNN | 93.5% |
Created seven disease classification models with TensorFlow, Random Forest and XGBoost to analyse patients’ medical records, achieving over 90% accuracy.Improved the accuracy of deep neural networks by 30% with image data augmentation and transfer learning
Challenges : Identifying package for tweet scraping and recognizing limitations on extraction, large execution times and runtime errors due to memory limitation for parts of data modeling. Medical information is difficult to come by. As a result, if the databases were made public, researchers would have access to additional information.