This repository contains our code for a competition organised by Centrale Supelec and Illuin Technology. It is possible to learn more about the tasks by looking at the content of the folder Explication dataset
or by looking at the final presentation in presentation
.
The competitions contained 2 parts, the first contains 3 tasks of NER, NLI, text classification... and the second is the creation of a search engine capable of finding patients based on filters and a search Query.
- Transformers library by HuggingFace
- Scibert
- Biobert
- Electramed
- MiniLM-L6
- Streamlit
- Flask
- Annoy
First, we need to download the submodule for evaluation :
$ git submodule init
$ git submodule update
You can find the data here https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
First we need to have the initial data as follows :
medical_txt_parser
├── Explication dataset/
├── train_data/
├── beth/
├── ast/
...
└── record-13.ast
├── concept
...
└── record-13.con
├── rel
...
└── record-13.rel
└── txt
...
└── record-13.txt
└── partners/
├── ast/
...
└── record-10.ast
├── concept
...
└── record-10.con
├── rel
...
└── record-10.rel
└── txt
...
└── record-10.txt
└── src/
Then execute the following command to build the dataset from the root of the project:
$ ./src/data_merger.sh
To prepare the embeddings and clusters for the search API:
$ cd src
$ python -m clustering.prepare_embeddings
To launch the app, start in the root directory of the project by executing :
$ python src/api.py
$ streamlit run app/search_engine.py