Investigating the Impact of Query Representation on Medical Information Retrieval

This repository contains the experiments related to the paper entitled "Investigating the Impact of Query Representation on Medical Information Retrieval", published in European Conference on Information Retrieval 2023.

[Folder Structure] Information related to the available code, topics, collections and query reformulations.
[Describes Code] Medical Entities - Information Extraction Approaches.
[Describes Code] Information Extraction & Query formulation.
[Describes Code] Information Retrieval on TREC Clinical, Clinical Collection, TREC CDS Collections.
[Information]

1. Folder Structure

experiments
- indices
- medical_entity_extraction
- qrels
- topics
  - trec_clc
    - topics2021.txt : Contains the TREC 2021 topics.
    - extracted_med_entities
      - Here the ''method.csv'' files with the extracted medical entities will be saved using code 2. The entities have already been extracted.
      - chem_med7.csv : Entities extracted using med7
      - dis_chem_bio_bert.csv: Entities extracted using bio bert.
      - dis_chem_scispacy.csv : Entities extracted using scispacy.
      - dis_chem_stanza.csv : Entities extracted using stanza.
    - reformulated_topics
      - Here the experiment_name.csv files with the reformulated topics, will be saved using code 3.
      - Format: ['qid','query','experiment_name']
      - experiment_name = {Q10bert_All_not_negated_NERs_keep_Sentences,...,bert_problems_treat_test_not_negated_Neg_BERT}
  - cds_clinical
    - topics-2014_2015-description.topics : Contains the topics associated with the TREC cds collection and the other clinical collection.
    - extracted_med_entities
      - Here the ''method.csv'' files with the extracted medical entities will be saved using code 2. The entities have already been extracted.
      - chem_med7.csv : Entities extracted using med7
      - dis_chem_bio_bert.csv: Entities extracted using bio bert.
      - dis_chem_scispacy.csv : Entities extracted using scispacy.
      - dis_chem_stanza.csv : Entities extracted using stanza.
    - reformulated_topics
      - Here the experiment_name.csv files with the reformulated topics, will be saved using code 3.
      - Dataframe Format: ['qid','query','experiment_name']
      - experiment_name = {Q10bert_All_not_negated_NERs_keep_Sentences,...,bert_problems_treat_test_not_negated_Neg_BERT}

2. Medical Entities - Information Extraction Approaches.

To extract medical entities, for each topic, run the 4 notebooks in the medical_entity_extraction folder.

BioBert_disease.ipynb
medical_extraction_m7.ipynb
medical_extraction_stanza.ipynb
scispacy.ipynb

In general, the notebooks have the following inputs and output:

A selected collection, among the four employed in this work.
The original versions of the queries, i.e., topics2021.txt or topics-2014_2015-description.topics.
Saves in extracted entities in the extracted_med_entities folder.

Detailed instructions regarding their implementation are provided in each notebook.

3. Information Extraction & Query formulation.

The following notebook implements the information extraction approaches introduced in the paper. In addition it allows for all possible combinations of methods to be applied.

1. Query_Reformulation_Techniques.ipynb

In general, it has the following inputs and output:

A collection, among the four employed in this work.
The original versions of the queries, i.e., topics2021.txt or topics-2014_2015-description.topics.
Output: The reformulated_topics in csv format.

Detailed instructions regarding its implementation are provided in the notebook.

4. Information Retrieval on TREC Clinical, Clinical Collection, TREC CDS Collections.

To use the original topics and any reformulated topic, run the following notebooks, after you have indexed the required document collections and obtained the qrels.

TREC2021_Experiments.ipynb
Clinical_Experiments.ipynb
Clinical_Decision_Support_Track_2014_2015.ipynb

In general, the codes have the following inputs and output:

A collection, among the four employed in this work.
The original versions of the queries, one or more selected query variation and the index.
Performs information retrieval using PyTerrier.

5. Further Information and Citation

For any further information send an email at georgios.peikos@unimib.it.

Please cite: @inproceedings{peikos2023investigating, title={Investigating the Impact of Query Representation on Medical Information Retrieval}, author={Peikos, Georgios and Alexander, Daria and Pasi, Gabriella and de Vries, Arjen P}, booktitle={European Conference on Information Retrieval}, pages={512--521}, year={2023}, organization={Springer} }