This project aims to cluster de-identified radiology reports based on their similarity. Report similarity is defined in terms of their clinical findings. The main objective is to group similar reports together for further analytics, such as understanding the mix of findings for an individual radiologist or a practice. The project utilizes Natural Language Processing (NLP) techniques to preprocess the reports, compute similarities, and apply clustering algorithms.
-
Clone the Repository:
git clone https://github.com/emelalkim/radiology-report-clustering.git cd radiology-report-clustering
-
Install Dependencies:
pip install pandas scikit-learn nltk matplotlib pandas
or (for MAC python 3)
pip3 install pandas scikit-learn nltk matplotlib pandas
-
Ensure RadLex Terms CSV is Available: Make sure the
RADLEX_termsonly.csv
file is in the project directory.
-
Prepare the Data: Ensure you have the de-identified radiology reports in a CSV file (e.g.,
reports.csv
). Each report should have aclean_report
column containing the report text. -
Run the Script:
python similarity_analysis.py
-
Outputs:
- Clustering results:
clustering_results_dbscan.csv
- Cluster summary:
cluster_summary_dbscan.txt
- Clustering results:
similarity_analysis.py
: Main script to preprocess reports, compute similarities, find optimal DBSCAN parameters, and perform clustering.RADLEX_termsonly.csv
: CSV file containing RadLex terms and synonyms for medical term normalization.
-
Data Preparation:
- Extract the FINDINGS section from each radiology report.
- Normalize medical terms using RadLex dictionary.
- Tag measurements and preprocess text to remove stopwords and special characters.
-
Feature Extraction:
- Compute TF-IDF vectors for the preprocessed findings sections.
-
Similarity Calculation:
- Calculate the cosine distance matrix between TF-IDF vectors.
-
Parameter Optimization:
- Plot the k-distance graph to find the optimal epsilon value for DBSCAN. This method can be used to plot the graph to see where thw knee is.
-
Clustering:
- Apply DBSCAN using the optimal epsilon value and cosine distance matrix.
-
Results Analysis:
- Generate and save clustering results and summaries.
- Display samples of the most similar and most dissimilar reports.
- The script outputs the clustering results in
clustering_results_dbscan.csv
, showing which cluster each report belongs to. - A summary of each cluster's top keywords is saved in
cluster_summary_dbscan.txt
. - The script also prints the most similar and most dissimilar reports based on cosine similarity.
Contributions are welcome! Please fork the repository and submit a pull request with your improvements.
This project is licensed under the MIT License. See the LICENSE
file for details.