How to track which diagnoses become which tokens when clmbr batches are created?

Question

How to track which diagnoses become which tokens when clmbr batches are created?

Closed this issue 6 months ago · 2 comments

Hello, I was wondering what would be the simplest way to check which token corresponds to eg. which SNOMED code.
I was trying to infer from the dictionary object but this did not seem directly possible.

Answer 1 · 2024-06-28T08:43:25.000Z

Sorry for the confusion @ulzee and thanks for the comment!

Please run this script to view this data: https://github.com/som-shahlab/ehrshot-benchmark/blob/033715c3d5ed873c3fd2ab3cbc408d0efaf733ee/ehrshot/convert_dictionary_to_json.py

It will generate three files in EHRSHOT_ASSETS/models/clmbr:

dictionary.json => Raw representation of the exact contents in dictionary (which is a .msgpack file)
token_2_code.json => Dictionary where [key] = token ID (e.g. '0'), [value] = code (e.g. 'SNOMED/3950001')
token_2_description.json => Dictionary where [key] = token ID (e.g. '0'), [value] = code (e.g. 'Birth')

We will update EHRSHOT_ASSETS in our next version of the dataset release to include these files by default.

Answer 2 · 2024-06-28T16:56:59.000Z

Thank you for the clarifications. I think I'm still a bit lost on the tokens produced by femr.models.dataloader.BatchLoader because they are in the range of 0-65535, but dictionary.json contains 1729229 medical concepts. I assumed then concepts 65536-1729229 are not used or there is a many to few reduction somewhere. Or I'm missing something about the tokenization.