/Diabetes-classification-dataset

In this article, we proposed a new labeled diabetes dataset from a South Asian country (Bangladesh). Additionally, we recommended an automated classification pipeline, introducing a weighted ensemble of several Machine Learning (ML) classifiers: Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), XGBoost (XGB), and LightGBM (LGB). The critical hyperparameters of these ML models are tuned using a grid search hyperparameter optimization approach. Missing values imputation, feature selection, and K-fold cross-validation were also incorporated into the designed framework.

Early Prediction of Diabetes Using an Ensemble of Machine Learning Models

Diabetes is one of the fastest growing diseases globally, causing a plethora of critical complications like cardiovascular diseases, kidney failure, diabetic retinopathy, and neuropathy, among others leading to increased morbidity and mortality. The severity and underlying risk factors of diabetes can be considerably decreased if diabetes can be detected early. However, there is a lack of labeled data and the presence of outliers or data missingness in clinical datasets, which are reliable and efficient for diabetes prediction, making it an incredibly complicated task. Therefore, in this article, we proposed a new labeled diabetes dataset from a South Asian country (Bangladesh). Additionally, we recommended an automated classification pipeline, introducing a weighted ensemble of several Machine Learning (ML) classifiers: Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), XGBoost (XGB), and LightGBM (LGB). The critical hyperparameters of these ML models are tuned using a grid search hyperparameter optimization approach. Missing values imputation, feature selection, and K-fold cross-validation were also incorporated into the designed framework. A statistical ANOVA test demonstrated that the performance of diabetes prediction increased considerably when the proposed weighted ensemble (DT+RF+XGB+LGB) was executed with the introduced preprocessing, exhibiting the highest accuracy and Area Under the ROC Curve (AUC) of 0.735 and 0.832, respectively. Our statistical imputation and RF-based feature selection methods with the suggested ensemble model yielded the best result for early diabetes prediction. Moreover, the presented new dataset will contribute to developing and implementing robust ML models for diabetes prediction utilizing population-level data.

The overall workflow of this article has been illustrated in the following figure (see below) that essentially incorporates and investigates a preprocessing method and an ensemble ML classifier with hyperparameter optimization. Missing Value Imputation (MVI) and Feature Selection (FS) schemes are included in the suggested preprocessing. Additionally, K-fold cross-validation is applied to validate the proposed system's robustness by analyzing the inter-fold variations.

Block_diagram_Diabetes

This study was conducted utilizing Bangladesh Demographic and Health Survey (BDHS) datasets in 2011 and 2017-18 (see details in the table below). The BDHS records data nationally on people's socioeconomic characteristics, demographics, and numerous health factors. Two-stage stratified cluster sampling has been employed to accumulate data from selected households and surveyed through face-to-face interviews by the trained staff(s). We utilized totals of 5223 respondents aged 35 years and above who tested blood pressure and glucose level in BDHS-2011. Furthermore, 12,119 respondents aged 18 years and above were used in the 2017-18 BDHS survey. We consolidated the two BDHS datasets to create a substantially large sample to specify the risk factors for Diabetes Mellitus accurately.

Features Different features with short descriptions Categorical? Continuous? χ2-test or Mean ± std χ2-test or Mean ± std
DDC-2011 DDC-2017
F1 Division (The respondents' residence place) Yes No 144.689 (0.000) 383.774 (0.000)
F2 Location of respondents' residence area (Urban/Rural) Yes No 463.00 (0.496) 93.958 (0.000)
F3 Wealth index (Respondent's financial situation) Yes No 16.104 (0.003) 482.139 (0.000)
F4 Household's head sexuality (Gender of the household head) Yes No 5.858 (0.016) 4.298 (0.117)
F5 Age of household members No Yes 54.87±12.94 39.53±16.21
F6 Respondent's current educational status Yes No 6.041 (0.110) 6.960 (0.541)
F7 Occupation type of the respondent Yes No 30.430 (0.063) 185.659 (0.000)
F8 Eaten anything Yes No 0.663 (0.416) 3.065 (0.216)
F9 Had caffeinated drink Yes No 1.590 (0.207) 20.738 (0.000)
F10 Smoked Yes No 0.001 (0.985) 7.781 (0.020)
F11 Average of systolic No Yes 77.59±12.05 122.63±21.95
F12 Average of diastolic No Yes 119.93±21.93 80.52±13.67
F13 Body Mass Index (BMI) for respondent No Yes 2065.63±369.25 2239.43±416.47

A biomarker questionnaire was provided by the BDHS program to collect information regarding HTN and DM diagnosis and treatments. Following the World Health Organization (WHO) recommended measurement, these surveys generally gathered records of plasma glucose levels. Trained health technicians recorded DM data through HemoCue Glucose 201 Analyzer. To quantify blood glucose levels, BDHS applied WHO cut-off levels. The fasting blood glucose level was >= 7.0 mmol/L, indicating the existence of DM and categorized as ‘Yes’. Here, prediabetes (PBG: 6.0-6.9 mmol/L with no medical care) and diabetes-free (PBG: <6.0 mmol/L) varieties were incorporated according to the BDHS classification procedure and categorized as ‘No’. However, the different categorical and continuous independent variables are represented in the above table. The covariates comprised in the study are the age of the respondent (continuous), sex (male or female), educational level (no formal education, up to the primary, up to secondary, up to higher secondary), economic status (poorer, poor, middle, rich, richer), body mass index (continuous), occupation type (factory workers, beggars, boatmen, domestic servants, construction workers, brick breakers, road builders, rickshaw drivers, poultry raisers, cattle raisers, fishers, farmers, and agricultural workers, retired person, religious leader, housewife, businessman, family welfare visitor, teacher, accountant, lawyer, dentist, nurse, doctor, tailor, carpenter, unemployed/student, and landowner), eating habit (specified, anything), drinking coffee (no or yes), place of residence (urban or rural), division (Barisal, Chittagong, Dhaka, Khulna, Rajshahi, Rangpur, Sylhet, Mymensingh), an average of diastolic (continuous), and the average of systolic (continuous).

Written by-

Md. Kamrul Hasan
Erasmus Scholar on Medical Imaging and Application (MAIA) [2017-2019] [http://maiamaster.udg.edu/]
Assistant Professor
Department of EEE, KUET, Khulna-9203, Bangladesh
For more details write me at kamruleeekuet@gmail.com
Google Scholar: https://scholar.google.com/citations?user=36WXELIAAAAJ&hl=en