Early Prediction of Diabetes Using an Ensemble of Machine Learning Models

Diabetes is one of the fastest growing diseases globally, causing a plethora of critical complications like cardiovascular diseases, kidney failure, diabetic retinopathy, and neuropathy, among others leading to increased morbidity and mortality. The severity and underlying risk factors of diabetes can be considerably decreased if diabetes can be detected early. However, there is a lack of labeled data and the presence of outliers or data missingness in clinical datasets, which are reliable and efficient for diabetes prediction, making it an incredibly complicated task. Therefore, in this article, we proposed a new labeled diabetes dataset from a South Asian country (Bangladesh). Additionally, we recommended an automated classification pipeline, introducing a weighted ensemble of several Machine Learning (ML) classifiers: Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), XGBoost (XGB), and LightGBM (LGB). The critical hyperparameters of these ML models are tuned using a grid search hyperparameter optimization approach. Missing values imputation, feature selection, and K-fold cross-validation were also incorporated into the designed framework. A statistical ANOVA test demonstrated that the performance of diabetes prediction increased considerably when the proposed weighted ensemble (DT+RF+XGB+LGB) was executed with the introduced preprocessing, exhibiting the highest accuracy and Area Under the ROC Curve (AUC) of 0.735 and 0.832, respectively. Our statistical imputation and RF-based feature selection methods with the suggested ensemble model yielded the best result for early diabetes prediction. Moreover, the presented new dataset will contribute to developing and implementing robust ML models for diabetes prediction utilizing population-level data.

The overall workflow of this article has been illustrated in the following figure (see below) that essentially incorporates and investigates a preprocessing method and an ensemble ML classifier with hyperparameter optimization. Missing Value Imputation (MVI) and Feature Selection (FS) schemes are included in the suggested preprocessing. Additionally, K-fold cross-validation is applied to validate the proposed system's robustness by analyzing the inter-fold variations.

This study was conducted utilizing Bangladesh Demographic and Health Survey (BDHS) datasets in 2011 and 2017-18 (see details in the table below). The BDHS records data nationally on people's socioeconomic characteristics, demographics, and numerous health factors. Two-stage stratified cluster sampling has been employed to accumulate data from selected households and surveyed through face-to-face interviews by the trained staff(s). We utilized totals of 5223 respondents aged 35 years and above who tested blood pressure and glucose level in BDHS-2011. Furthermore, 12,119 respondents aged 18 years and above were used in the 2017-18 BDHS survey. We consolidated the two BDHS datasets to create a substantially large sample to specify the risk factors for Diabetes Mellitus accurately.

Features	Different features with short descriptions	Categorical?	Continuous?	χ2-test or Mean ± std	χ2-test or Mean ± std
				DDC-2011	DDC-2017
F1	Division (The respondents' residence place)	Yes	No	144.689 (0.000)	383.774 (0.000)
F2	Location of respondents' residence area (Urban/Rural)	Yes	No	463.00 (0.496)	93.958 (0.000)
F3	Wealth index (Respondent's financial situation)	Yes	No	16.104 (0.003)	482.139 (0.000)
F4	Household's head sexuality (Gender of the household head)	Yes	No	5.858 (0.016)	4.298 (0.117)
F5	Age of household members	No	Yes	54.87±12.94	39.53±16.21
F6	Respondent's current educational status	Yes	No	6.041 (0.110)	6.960 (0.541)
F7	Occupation type of the respondent	Yes	No	30.430 (0.063)	185.659 (0.000)
F8	Eaten anything	Yes	No	0.663 (0.416)	3.065 (0.216)
F9	Had caffeinated drink	Yes	No	1.590 (0.207)	20.738 (0.000)
F10	Smoked	Yes	No	0.001 (0.985)	7.781 (0.020)
F11	Average of systolic	No	Yes	77.59±12.05	122.63±21.95
F12	Average of diastolic	No	Yes	119.93±21.93	80.52±13.67
F13	Body Mass Index (BMI) for respondent	No	Yes	2065.63±369.25	2239.43±416.47

A biomarker questionnaire was provided by the BDHS program to collect information regarding HTN and DM diagnosis and treatments. Following the World Health Organization (WHO) recommended measurement, these surveys generally gathered records of plasma glucose levels. Trained health technicians recorded DM data through HemoCue Glucose 201 Analyzer. To quantify blood glucose levels, BDHS applied WHO cut-off levels. The fasting blood glucose level was >= 7.0 mmol/L, indicating the existence of DM and categorized as ‘Yes’. Here, prediabetes (PBG: 6.0-6.9 mmol/L with no medical care) and diabetes-free (PBG: <6.0 mmol/L) varieties were incorporated according to the BDHS classification procedure and categorized as ‘No’. However, the different categorical and continuous independent variables are represented in the above table. The covariates comprised in the study are the age of the respondent (continuous), sex (male or female), educational level (no formal education, up to the primary, up to secondary, up to higher secondary), economic status (poorer, poor, middle, rich, richer), body mass index (continuous), occupation type (factory workers, beggars, boatmen, domestic servants, construction workers, brick breakers, road builders, rickshaw drivers, poultry raisers, cattle raisers, fishers, farmers, and agricultural workers, retired person, religious leader, housewife, businessman, family welfare visitor, teacher, accountant, lawyer, dentist, nurse, doctor, tailor, carpenter, unemployed/student, and landowner), eating habit (specified, anything), drinking coffee (no or yes), place of residence (urban or rural), division (Barisal, Chittagong, Dhaka, Khulna, Rajshahi, Rangpur, Sylhet, Mymensingh), an average of diastolic (continuous), and the average of systolic (continuous).

Written by-

Md. Kamrul Hasan
Erasmus Scholar on Medical Imaging and Application (MAIA) [2017-2019] [http://maiamaster.udg.edu/]
Assistant Professor
Department of EEE, KUET, Khulna-9203, Bangladesh
For more details write me at kamruleeekuet@gmail.com
Google Scholar: https://scholar.google.com/citations?user=36WXELIAAAAJ&hl=en

kamruleee51/Diabetes-classification-dataset

Early Prediction of Diabetes Using an Ensemble of Machine Learning Models

Written by-