chanzuckerberg/cellxgene-census

Running the geneformer example results in KeyError

Opened this issue · 1 comments

Describe the bug

Trying to run the geneformer example on provided testdata as explained in tutorials. After installing the last version of geneformer from the hugging faces repository and some plumbing to get everything to work, I run into the following error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[92], [line 7](vscode-notebook-cell:?execution_count=92&line=7)
      [3](vscode-notebook-cell:?execution_count=92&line=3) # create the trainer
      [5](vscode-notebook-cell:?execution_count=92&line=5) kwargs = {"token_dictionary": tokenizer.gene_token_dict};
      [6](vscode-notebook-cell:?execution_count=92&line=6) trainer = Trainer(model=model,
----> [7](vscode-notebook-cell:?execution_count=92&line=7)                 data_collator=DataCollatorForCellClassification())
      [8](vscode-notebook-cell:?execution_count=92&line=8) # use trainer

File ~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:611, in DataCollatorForGeneClassification.__init__(self, *args, **kwargs)
    [610](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:610) def __init__(self, *args, **kwargs) -> None:
--> [611](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:611)     self.token_dictionary = kwargs.pop("token_dictionary")
    [612](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:612)     super().__init__(
    [613](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:613)         tokenizer=PrecollatorForGeneAndCellClassification(
    [614](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:614)             token_dictionary=self.token_dictionary
   (...)
    [621](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:621)         **kwargs,
    [622](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:622)     )

KeyError: 'token_dictionary'

when I try to execute

# reload pretrained model
model = BertForSequenceClassification.from_pretrained(model_dir)
# create the trainer
trainer = Trainer(model=model,
                data_collator=DataCollatorForCellClassification())

All data being used is from the example data and none from external sources.

I tried overrulling the token_dictionary by performing:

# reload pretrained model
model = BertForSequenceClassification.from_pretrained(model_dir)
# create the trainer

kwargs = {"token_dictionary": tokenizer.gene_token_dict};
trainer = Trainer(model=model,
                data_collator=DataCollatorForCellClassification(**kwargs))

But this results in the following error during training:

File ~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1073, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   [1071](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1071) if hasattr(self.embeddings, "token_type_ids"):
   [1072](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1072)     buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
-> [1073](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1073)     buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
   [1074](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1074)     token_type_ids = buffered_token_type_ids_expanded
   [1075](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1075) else:

RuntimeError: The expanded size of the tensor (2377) must match the existing size (2048) at non-singleton dimension 1.  Target sizes: [8, 2377].  Tensor sizes: [1, 2048]

To Reproduce

Run the geneformer notebook using the latest version of geneformer installed.

Environment

Provide a description of your system and the software versions.

Mac M1 pro, running python 3.10 with most important libs:

cellxgene-census 1.16.2
geneformer 0.1.0
tiledb 0.32.5
tiledbsoma 1.14.5
torch 2.5.1
transformers 4.46.2

mlin commented

@ddemaeyer Can you please try setting geneformer to the specific git revision eb038a6?

pip install git+https://huggingface.co/ctheodoris/Geneformer@eb038a6

Sorry for the roadbump. That revision is what we coded against when we created the example; at that time (possibly still?), the Geneformer repository didn't have tagged/released versions, making it a little challenging to track with subsequent changes. We'll be doing some work shortly to update the cellxgene_census Geneformer integration to a newer version, but it will take some time to get out the door. Thanks!