[Question] Language availability
Closed this issue · 2 comments
Hello,
Does MedCAT have models or use datasets that are not in english but a different language like french or spanish ?
hi - MedCAT doesn't rely on english specific language features so there is no reason why fr, es and any other language could not be supported.
@sandertan has already added supported for dutch diacritics and has trained Dutch MedCAT models for example. However, like our own core-team models, they are trained on sensitive hospital data so directly sharing these models with the wider community is difficult.
The only models we currently provide to the wider community are English test models based trained off of MT Samples or medmentions i.e. openly available datasets.
Hi @IKetchup , indeed we made models for Dutch language from public UMLS and SNOMED data. We documented our methods in an internal repo. We're looking into open-sourcing this, with a few changes it might be fairly easy to generate something similar for other languages.
In the meantime, you could look into downloading the UMLS database, loading it in a MySQL database, and filter the concepts for your language and source vocabularies from the MRCONSO table, and put them in the MedCAT format described at https://github.com/CogStack/MedCAT/tree/master/examples.