/inf_extraction_med_ir

This repository contains the experiments related to the paper entitled "Investigating the Impact of Query Representation on Medical Information Retrieval"

Primary LanguageJupyter NotebookMIT LicenseMIT

Investigating the Impact of Query Representation on Medical Information Retrieval

This repository contains the experiments related to the paper entitled "Investigating the Impact of Query Representation on Medical Information Retrieval", published in European Conference on Information Retrieval 2023.

Contents

  1. [Folder Structure] Information related to the available code, topics, collections and query reformulations.
  2. [Describes Code] Medical Entities - Information Extraction Approaches.
  3. [Describes Code] Information Extraction & Query formulation.
  4. [Describes Code] Information Retrieval on TREC Clinical, Clinical Collection, TREC CDS Collections.
  5. [Information]

1. Folder Structure

  • experiments
    • indices
      • Add the indices of TREC2021 clinical (trec_clc folder), the clinical collection (clinical floder). For the latter, refer to the paper for download details.
      • CDS collections are obtained by ir_datasets.
      • [sub-folder] trec_clc
      • [sub-folder] clinical
    • medical_entity_extraction
      • Contains the notebooks that extract the medical entities, using the various libraries.
    • qrels
      • Add the TREC 2021 clinical qrels. [refer to TREC 2021 Clinical trials task.]
      • Add the qrels related to the other clinical collection [Refer to the paper for download details.].
      • For the CDS collections qrels are obtained by ir_datasets.
    • topics
      • trec_clc
        • topics2021.txt : Contains the TREC 2021 topics.
        • extracted_med_entities
          • Here the ''method.csv'' files with the extracted medical entities will be saved using code 2. The entities have already been extracted.
          • chem_med7.csv : Entities extracted using med7
          • dis_chem_bio_bert.csv: Entities extracted using bio bert.
          • dis_chem_scispacy.csv : Entities extracted using scispacy.
          • dis_chem_stanza.csv : Entities extracted using stanza.
        • reformulated_topics
          • Here the experiment_name.csv files with the reformulated topics, will be saved using code 3.
          • Format: ['qid','query','experiment_name']
          • experiment_name = {Q10bert_All_not_negated_NERs_keep_Sentences,...,bert_problems_treat_test_not_negated_Neg_BERT}
      • cds_clinical
        • topics-2014_2015-description.topics : Contains the topics associated with the TREC cds collection and the other clinical collection.
        • extracted_med_entities
          • Here the ''method.csv'' files with the extracted medical entities will be saved using code 2. The entities have already been extracted.
          • chem_med7.csv : Entities extracted using med7
          • dis_chem_bio_bert.csv: Entities extracted using bio bert.
          • dis_chem_scispacy.csv : Entities extracted using scispacy.
          • dis_chem_stanza.csv : Entities extracted using stanza.
        • reformulated_topics
          • Here the experiment_name.csv files with the reformulated topics, will be saved using code 3.
          • Dataframe Format: ['qid','query','experiment_name']
          • experiment_name = {Q10bert_All_not_negated_NERs_keep_Sentences,...,bert_problems_treat_test_not_negated_Neg_BERT}

2. Medical Entities - Information Extraction Approaches.

To extract medical entities, for each topic, run the 4 notebooks in the medical_entity_extraction folder.

  • BioBert_disease.ipynb
  • medical_extraction_m7.ipynb
  • medical_extraction_stanza.ipynb
  • scispacy.ipynb

In general, the notebooks have the following inputs and output:

  • A selected collection, among the four employed in this work.
  • The original versions of the queries, i.e., topics2021.txt or topics-2014_2015-description.topics.
  • Saves in extracted entities in the extracted_med_entities folder.
Detailed instructions regarding their implementation are provided in each notebook.

3. Information Extraction & Query formulation.

The following notebook implements the information extraction approaches introduced in the paper. In addition it allows for all possible combinations of methods to be applied.

  • 1. Query_Reformulation_Techniques.ipynb

In general, it has the following inputs and output:

  • A collection, among the four employed in this work.
  • The original versions of the queries, i.e., topics2021.txt or topics-2014_2015-description.topics.
  • Output: The reformulated_topics in csv format.
Detailed instructions regarding its implementation are provided in the notebook.

4. Information Retrieval on TREC Clinical, Clinical Collection, TREC CDS Collections.

To use the original topics and any reformulated topic, run the following notebooks, after you have indexed the required document collections and obtained the qrels.

  • TREC2021_Experiments.ipynb
  • Clinical_Experiments.ipynb
  • Clinical_Decision_Support_Track_2014_2015.ipynb

In general, the codes have the following inputs and output:

  • A collection, among the four employed in this work.
  • The original versions of the queries, one or more selected query variation and the index.
  • Performs information retrieval using PyTerrier.

5. Further Information and Citation

For any further information send an email at georgios.peikos@unimib.it.

Please cite: @inproceedings{peikos2023investigating, title={Investigating the Impact of Query Representation on Medical Information Retrieval}, author={Peikos, Georgios and Alexander, Daria and Pasi, Gabriella and de Vries, Arjen P}, booktitle={European Conference on Information Retrieval}, pages={512--521}, year={2023}, organization={Springer} }