Bridging the Data Gap: An Experimental Investigation into Demographic Representation in Healthcare AI
This project investigates the integration of data quality and demographic inclusivity in the development of healthcare AI models. It centers on a systematic review of research presented at the 2023 MICCAI conference, a significant venue for advancements in medical imaging and AI.
- Examine and Annotate Datasets: Analyze the datasets used in selected studies from the conference to understand the extent of demographic data incorporation during model training.
- Identify Gaps and Biases: Explore how these models account for diverse datasets and demographic details, identify any prevalent gaps, and assess potential biases.
The ultimate goal of this research is to contribute to the development of equitable healthcare AI by ensuring that AI models are trained on data that adequately reflects the diversity of patient populations. By highlighting and addressing gaps in data inclusivity and quality, this thesis advocates for more inclusive and accurate healthcare AI applications.
This GitHub repository hosts the notebooks and tools developed as part of this thesis to automate the extraction, processing, and analysis of data from the MICCAI 2023 conference, aiding in the systematic review and providing a structured foundation for further research in this crucial area.
Details about the directory structure can be found in the following link:
To install the necessary Python packages, use the requirements.txt
file. Below is the content of requirements.txt
which lists all the packages used in the project:
# requirements.txt with the packages you used
pandas==1.5.3
numpy==1.26.4
et-xmlfile==1.1.0
grobid-client-python==0.0.8
lxml==5.2.1
openpyxl==3.1.2
plotly==5.20.0
PyPDF2==3.0.1
PyMuPDF==1.24.2
PyMuPDFb==1.24.1
regex==2023.12.25
seaborn==0.13.2
squarify==0.4.3
zipp==3.17.0
kaleido==0.2.1
To install these packages, run:
pip install -r requirements.txt
To use GROBID, follow these steps:
-
Download GROBID:
wget https://github.com/kermitt2/grobid/archive/0.8.0.zip
-
Unzip the downloaded file:
unzip 0.8.0.zip
-
Change to the GROBID directory:
cd grobid-0.8.0
-
Run GROBID:
./gradlew run
Below are the import statements used in the notebooks:
import fitz # PyMuPDF for PDF rendering and manipulation
import PyPDF2 # PyPDF2 for PDF file reading and manipulation
import zipfile
import requests
from bs4 import BeautifulSoup
import io
from io import BytesIO
import urllib.request
from urllib.request import urlopen
import os
import re
import sys
import textwrap
import subprocess
import regex as re
import numpy as np
import pandas as pd
from xml.etree import ElementTree as et
from lxml import etree
from collections import Counter
from grobid_client.grobid_client import GrobidClient
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib import colormaps
import matplotlib.patches as mpatches
import matplotlib.ticker as ticker
import squarify
import seaborn as sns
import plotly.graph_objects as go
import plotly.io as pio
Uncomment the necessary installation commands if required:
# pip install PyPDF2
# pip install xhtml2pdf requests lxml
# pip install xhtml2pdf requests
# pip install lxml
# pip install beautifulsoup4
# pip install pandas
# pip install numpy
# pip install regex
# pip install zipfile
# pip install lxml
# pip install pandas openpyxl
# pip install beautifulsoup4 requests lxml regex
# pip install pandas openpyxl
# !pip install kaleido
# !pip install -U kaleido
This section outlines the objectives and key functionalities of each notebook included in this repository, specifically designed for processing data from the MICCAI 2023 conference.
Objective: Automate the extraction and processing of data from HTML documents.
- Features:
- Extract titles, authors, DOIs, and page numbers from HTML content.
- Organizes information into structured formats for further analysis.
- Links:
Objective: Divide multi-article PDF volumes into individual PDFs for each article.
- Features:
- Manually remove non-article pages before automated splitting based on specified page numbers.
- Link: Springer Link - MICCAI 2023 Volumes
Objective: Preprocess and analyze XML documents from the conference.
- Features:
- Extract structured information from raw XML data.
- Identify and extract key content such as headers and titles.
- Focus on cancer-related research.
- Aggregate data into structured dataframes for advanced analysis.
Objective: Extract relevant sentences from MICCAI 2023 research papers based on specific keywords.
- Features: Categorize sentences by paper titles, focusing on crucial information like demographics.
Objective: Conduct detailed analyses and annotations of medical imaging articles.
- Features:
- Extract and categorize data about organs, image types, and datasets.
- Analyze distribution across various demographic parameters.
- Evaluate geographical locations and disclosure status of datasets.
Details and data formats used for annotation in this project:
Guidelines and schemes applied for data annotation within the project: