How to track which diagnoses become which tokens when clmbr batches are created?
Closed this issue · 2 comments
Hello, I was wondering what would be the simplest way to check which token corresponds to eg. which SNOMED code.
I was trying to infer from the dictionary object but this did not seem directly possible.
Sorry for the confusion @ulzee and thanks for the comment!
Please run this script to view this data: https://github.com/som-shahlab/ehrshot-benchmark/blob/033715c3d5ed873c3fd2ab3cbc408d0efaf733ee/ehrshot/convert_dictionary_to_json.py
It will generate three files in EHRSHOT_ASSETS/models/clmbr
:
dictionary.json
=> Raw representation of the exact contents indictionary
(which is a .msgpack file)token_2_code.json
=> Dictionary where [key] = token ID (e.g. '0'), [value] = code (e.g. 'SNOMED/3950001')token_2_description.json
=> Dictionary where [key] = token ID (e.g. '0'), [value] = code (e.g. 'Birth')
We will update EHRSHOT_ASSETS
in our next version of the dataset release to include these files by default.
Thank you for the clarifications. I think I'm still a bit lost on the tokens produced by femr.models.dataloader.BatchLoader
because they are in the range of 0-65535, but dictionary.json contains 1729229 medical concepts. I assumed then concepts 65536-1729229 are not used or there is a many to few reduction somewhere. Or I'm missing something about the tokenization.