This repository contains the experiments related to the paper entitled "Investigating the Impact of Query Representation on Medical Information Retrieval", published in European Conference on Information Retrieval 2023.
- [Folder Structure] Information related to the available code, topics, collections and query reformulations.
- [Describes Code] Medical Entities - Information Extraction Approaches.
- [Describes Code] Information Extraction & Query formulation.
- [Describes Code] Information Retrieval on TREC Clinical, Clinical Collection, TREC CDS Collections.
- [Information]
- experiments
- indices
- Add the indices of TREC2021 clinical (trec_clc folder), the clinical collection (clinical floder). For the latter, refer to the paper for download details.
- CDS collections are obtained by ir_datasets.
- [sub-folder] trec_clc
- [sub-folder] clinical
- medical_entity_extraction
- Contains the notebooks that extract the medical entities, using the various libraries.
- qrels
- Add the TREC 2021 clinical qrels. [refer to TREC 2021 Clinical trials task.]
- Add the qrels related to the other clinical collection [Refer to the paper for download details.].
- For the CDS collections qrels are obtained by ir_datasets.
- topics
- trec_clc
- topics2021.txt : Contains the TREC 2021 topics.
- extracted_med_entities
- Here the ''method.csv'' files with the extracted medical entities will be saved using code 2. The entities have already been extracted.
- chem_med7.csv : Entities extracted using med7
- dis_chem_bio_bert.csv: Entities extracted using bio bert.
- dis_chem_scispacy.csv : Entities extracted using scispacy.
- dis_chem_stanza.csv : Entities extracted using stanza.
- reformulated_topics
- Here the experiment_name.csv files with the reformulated topics, will be saved using code 3.
- Format: ['qid','query','experiment_name']
- experiment_name = {Q10bert_All_not_negated_NERs_keep_Sentences,...,bert_problems_treat_test_not_negated_Neg_BERT}
- cds_clinical
- topics-2014_2015-description.topics : Contains the topics associated with the TREC cds collection and the other clinical collection.
- extracted_med_entities
- Here the ''method.csv'' files with the extracted medical entities will be saved using code 2. The entities have already been extracted.
- chem_med7.csv : Entities extracted using med7
- dis_chem_bio_bert.csv: Entities extracted using bio bert.
- dis_chem_scispacy.csv : Entities extracted using scispacy.
- dis_chem_stanza.csv : Entities extracted using stanza.
- reformulated_topics
- Here the experiment_name.csv files with the reformulated topics, will be saved using code 3.
- Dataframe Format: ['qid','query','experiment_name']
- experiment_name = {Q10bert_All_not_negated_NERs_keep_Sentences,...,bert_problems_treat_test_not_negated_Neg_BERT}
- trec_clc
To extract medical entities, for each topic, run the 4 notebooks in the medical_entity_extraction folder.
- BioBert_disease.ipynb
- medical_extraction_m7.ipynb
- medical_extraction_stanza.ipynb
- scispacy.ipynb
In general, the notebooks have the following inputs and output:
- A selected collection, among the four employed in this work.
- The original versions of the queries, i.e., topics2021.txt or topics-2014_2015-description.topics.
- Saves in extracted entities in the extracted_med_entities folder.
The following notebook implements the information extraction approaches introduced in the paper. In addition it allows for all possible combinations of methods to be applied.
- 1. Query_Reformulation_Techniques.ipynb
In general, it has the following inputs and output:
- A collection, among the four employed in this work.
- The original versions of the queries, i.e., topics2021.txt or topics-2014_2015-description.topics.
- Output: The reformulated_topics in csv format.
To use the original topics and any reformulated topic, run the following notebooks, after you have indexed the required document collections and obtained the qrels.
- TREC2021_Experiments.ipynb
- Clinical_Experiments.ipynb
- Clinical_Decision_Support_Track_2014_2015.ipynb
In general, the codes have the following inputs and output:
- A collection, among the four employed in this work.
- The original versions of the queries, one or more selected query variation and the index.
- Performs information retrieval using PyTerrier.
For any further information send an email at georgios.peikos@unimib.it.
Please cite: @inproceedings{peikos2023investigating, title={Investigating the Impact of Query Representation on Medical Information Retrieval}, author={Peikos, Georgios and Alexander, Daria and Pasi, Gabriella and de Vries, Arjen P}, booktitle={European Conference on Information Retrieval}, pages={512--521}, year={2023}, organization={Springer} }