zktuong/dandelion

UnboundLocalError: local variable 'contig' referenced before assignment

christie-ga opened this issue · 14 comments

Description of the bug

Hi,

I am using the singularity container for BCR data preprocessing, but I keep getting the following error message for some of my files (just single files). Can you help me to figure out what I have to change/do to get it running?

START> CreateGermlines
FILE> TUXXX_light_igblast_db-pass.tsv
GERM_TYPES> dmask
SEQ_FIELD> sequence_alignment
V_FIELD> v_call
D_FIELD> d_call
J_FIELD> j_call
CLONED> False

PROGRESS> 20:33:11 |## | 10% ( 10) 0.0 minTraceback (most recent call last):
File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 354, in
createGermlines(**args_dict)
File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 148, in createGermlines
for key, records in receptor_iter:
File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 133, in
receptor_iter = ((x.sequence_id, [x]) for x in db_iter)
File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/changeo/IO.py", line 75, in next
record = next(self.reader)
File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/airr/io.py", line 99, in next
raise ValueError('row has extra data')
ValueError: row has extra data
For convenience, entries for light chain v_call are copied to v_call_genotyped.
Returning summary plot
/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/plotnine/ggplot.py:817: PlotnineWarning: Filename: TU639/TU639_reassign_alleles.pdf
Writing out to individual folders : 50%|█████ | 1/2 [00:00<00:00, 7.36it/s] /opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/utilities/_utilities.py:559: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Writing out to individual folders : 50%|█████ | 1/2 [00:00<00:00, 5.71it/s]
Traceback (most recent call last):
File "/share/dandelion_preprocess.py", line 303, in
main()
File "/share/dandelion_preprocess.py", line 241, in main
ddl.pp.reassign_alleles(
File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/preprocessing/_preprocessing.py", line 1659, in reassign_alleles
write_airr(out_file, outfilepath.replace(".tsv", "_genotyped.tsv"))
File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/utilities/_utilities.py", line 668, in write_airr
data = sanitize_data(data)
File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/utilities/_utilities.py", line 426, in sanitize_data
validate_airr(data)
File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/utilities/_utilities.py", line 533, in validate_airr
RearrangementSchema.validate_header(contig.keys())
UnboundLocalError: local variable 'contig' referenced before assignment

Thank you very much for you help!!

Minimal reproducible example

No response

The error message produced by the code above

No response

OS information

No response

Version information

No response

Additional context

No response

hi, can you tell me what version of the container are you using?

Secondly, are some of your files e.g. TUXXX_light_igblast_db-pass.tsv empty?

oh wait i just noticed

File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 354, in
createGermlines(**args_dict)
File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 148, in createGermlines
for key, records in receptor_iter:
File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 133, in
receptor_iter = ((x.sequence_id, [x]) for x in db_iter)
File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/changeo/IO.py", line 75, in next
record = next(self.reader)
File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/airr/io.py", line 99, in next
raise ValueError('row has extra data')
ValueError: row has extra data

I think this is what's throwing up the problem - the problematic files were failing the airr IO function in changeo. I'm not exactly sure what the extra data means but perhaps you should first check your files can be read in with airr functions here:
https://docs.airr-community.org/en/v1.2.1/packages/airr-python/overview.html

Hi,

thank you very much for your quick reply. I figured out that I had two files sharing the same individual ID (I am using a metadata file with --meta), which caused the problem!

I changed the metadata file so that it only contains unique invidivual IDs and now it runs without errors...

Thanks!

@zktuong what is the appropriate way to specify the --meta file such that dandelion treats multiple samples from the same individual as coming from the same individual? The workaround @christie-ga has suggested works fine but as far as I can tell will lead to individual samples (and therefore germline assignment and clone assignment) being processed per-sample, rather than per-individual.

@benjacobs123456 the script actually looks for the individual column name - if it's there (literally), then it will handle the individuals accordingly. You can see here for an example https://github.com/zktuong/dandelion-demo-files/blob/master/dandelion_manuscript/data/dandelion-remap/BCR_metadata.csv

Thanks @zktuong

I've tried to run the following:
singularity run -B $PWD --env R_LIBS_USER=~/dummy/:$R_LIBS_USER \ ~/dandelion_jan23/sc-dandelion_latest.sif dandelion-preprocess \ --file_prefix "filtered" \ --filter_to_high_confidence \ --meta meta_file.csv

Where meta_file.csv looks like this:
(base) [hpcjaco1@login-q-1 BCR]$ head meta_file.csv
sample,individual
CSF_8A2H3PL1,8A2H3PL1
CSF_9X14GT24,9X14GT24
CSF_E9LEH7P8,E9LEH7P8
CSF_FW1LNFL5,FW1LNFL5
CSF_FW828DF0,FW828DF0
CSF_HU0X65T7,HU0X65T7
CSF_JEGK54J2,JEGK54J2
CSF_PWXFC1K5,PWXFC1K5
CSF_RL4X6288,RL4X6288

It runs fine until the tigger step, when I get the following error:

      Attempting to run tigger-genotype without novel allele discovery.
Running command: tigger-genotype.R -d E9LEH7P8/E9LEH7P8_heavy_igblast_db-pass.tsv -r /share/database/ger                            mlines/imgt/human/vdj/imgt_human_IGHV.fasta -n E9LEH7P8_heavy_igblast_db-pass -N NO -o E9LEH7P8 -f airr

Warning message:
One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
null device
          1
tigger-genotype execution took: 0:00:08 secs (Wall clock time)

Running command: CreateGermlines.py -d E9LEH7P8/E9LEH7P8_heavy_igblast_db-pass_genotyped.tsv -g dmask -r                             E9LEH7P8/E9LEH7P8_heavy_igblast_db-pass_genotype.fasta /share/database/germlines/imgt/human/vdj//imgt_h                            uman_IGHD.fasta /share/database/germlines/imgt/human/vdj//imgt_human_IGHJ.fasta --vf v_call_genotyped

     START> CreateGermlines
      FILE> E9LEH7P8_heavy_igblast_db-pass_genotyped.tsv
GERM_TYPES> dmask
 SEQ_FIELD> sequence_alignment
   V_FIELD> v_call_genotyped
   D_FIELD> d_call
   J_FIELD> j_call
    CLONED> False

PROGRESS> 14:06:29 |                    |   0% (  0) 0.0 minTraceback (most recent call last):
  File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 354, in <module>
    createGermlines(**args_dict)
  File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 148, in createGermlines
    for key, records in receptor_iter:
  File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 133, in <genexpr>
    receptor_iter = ((x.sequence_id, [x]) for x in db_iter)
  File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/changeo/IO.py", line 75, in _                            _next__
    record = next(self.reader)
  File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/airr/io.py", line 99, in __ne                            xt__
    raise ValueError('row has extra data')
ValueError: row has extra data
Running command: CreateGermlines.py -d E9LEH7P8/E9LEH7P8_light_igblast_db-pass.tsv -g dmask -r /share/da                            tabase/germlines/imgt/human/vdj//imgt_human_IGKV.fasta /share/database/germlines/imgt/human/vdj//imgt_hu                            man_IGKJ.fasta /share/database/germlines/imgt/human/vdj//imgt_human_IGLV.fasta /share/database/germlines                            /imgt/human/vdj//imgt_human_IGLJ.fasta --vf v_call

     START> CreateGermlines
      FILE> E9LEH7P8_light_igblast_db-pass.tsv
GERM_TYPES> dmask
 SEQ_FIELD> sequence_alignment
   V_FIELD> v_call
   D_FIELD> d_call
   J_FIELD> j_call
    CLONED> False

PROGRESS> 14:06:40 |                    |   0% (  0) 0.0 minTraceback (most recent call last):
  File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 354, in <module>
    createGermlines(**args_dict)
  File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 148, in createGermlines
    for key, records in receptor_iter:
  File "/opt/conda/envs/sc-dandelion-container/bin/CreateGermlines.py", line 133, in <genexpr>
    receptor_iter = ((x.sequence_id, [x]) for x in db_iter)
  File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/changeo/IO.py", line 75, in _                            _next__
    record = next(self.reader)
  File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/airr/io.py", line 99, in __ne                            xt__
    raise ValueError('row has extra data')
ValueError: row has extra data
      For convenience, entries for light chain `v_call` are copied to `v_call_genotyped`.
Returning summary plot
/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/plotnine/ggplot.py:817: PlotnineWarni                            ng: Filename: E9LEH7P8/E9LEH7P8_reassign_alleles.pdf
      Reassigning alleles
            Reconstructing heavy chain dmask germline sequences with v_call_genotyped.
      Reassigning alleles
            Reconstructing heavy chain dmask germline sequences with v_call_genotyped.
            Reconstructing light chain dmask germline sequences with v_call.
Writing out to individual folders :   0%|          | 0/2 [00:00<?, ?it/s]/opt/conda/envs/sc-dandelion-co                            ntainer/lib/python3.9/site-packages/dandelion/utilities/_utilities.py:557: FutureWarning: The default dt                            ype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicit                            ly to silence this warning.
Writing out to individual folders :  50%|█████     | 1/2 [00:00<00:00,  9.65it/s]
Traceback (most recent call last):
  File "/share/dandelion_preprocess.py", line 314, in <module>
    main()
  File "/share/dandelion_preprocess.py", line 249, in main
    ddl.pp.reassign_alleles(
  File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/preprocessing/_prep                            rocessing.py", line 1729, in reassign_alleles
    write_airr(out_file, outfilepath.replace(".tsv", "_genotyped.tsv"))
  File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/utilities/_utilitie                            s.py", line 668, in write_airr
    data = sanitize_data(data)
  File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/utilities/_utilitie                            s.py", line 424, in sanitize_data
    validate_airr(data)
  File "/opt/conda/envs/sc-dandelion-container/lib/python3.9/site-packages/dandelion/utilities/_utilitie                            s.py", line 531, in validate_airr
    RearrangementSchema.validate_header(contig.keys())
UnboundLocalError: local variable 'contig' referenced before assignment

I'm using the most recent singularity image (the one you kindly sent - thank you :) ). This meta file worked previously with the same directory structure and same raw input files.

Again I think the issue is the 'extra data' but I'm not sure what this is. Here is an example of one of the fasta files

>AAAGCAAAGACGCTTT-1_contig_1
GGAGTGCTTTCTGAGAGTCATGGACCTCCTGCACAAGAACATGAAACACCTGTGGTTCTTCCTCCTCCTGGTGGCAGCTCCCAGATGGGTCCTGTCCCAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCGTAGTGGAAGCACCAGCTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGCAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGGCGAACGGGAGGAACAGCACCTGGTCCGCCTAACCTTATTTGACTGCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGATACGAGCAGCGTG
>AAAGCAAAGACGCTTT-1_contig_2
TGAGCGCAGAAGGCAGGACTCGGGACAATCTTCATCATGACCTGCTCCCCTCTCCTCCTCACCCTTCTCATTCACTGCACAGGGTCCTGGGCCCAGTCTGTGTTGACGCAGCCGCCCTCAGTGTCTGCGGCCCCAGGACAGAAGGTCACCATCTCCTGCTCTGGAAGCAGCTCCAACATTGGGAATAATTATGTATCCTGGTACCAGCAGCTCCCAGGAACAGCCCCCAAACTCCTCATTTATGACAATAATAAGCGACCCTCAGGGATTCCTGACCGATTCTCTGGCTCCAAGTCTGGCACGTCAGCCACCCTGGGCATCACCGGACTCCAGACTGGGGACGAGGCCGATTATTACTGCGGAACATGGGATAGCAGCCTGAGTGTTGTGATATTCGGCGGAGGGACCAAGCTGACCGTCCTAGGTCAGCCCAAGGCTGCCCCCTCGGTCACTCTGTTCCCGCCCTCCTCTGAGGAGCTTCAAGCCAACAAGGCCACACTGGTGTGTCTCATAAGTGACTTCTACCCGGGAGCCGTGACAGTGGCCTGGAAGGCAGATAGCAGCCCCGTCAAGGCGGGAGTGGAGACCACCACACCCTCCAAACAAAGCAACAACAAGTACGCGGCCAGCAGCTA

I've tried using the airr functions to read the filtered_contig_annotations.csv in and these work perfectly:

>>> df = airr.load_rearrangement("xxx/dandelion_inputs/BCR/PBMC_E9LEH7P8/filtered_contig_annotations.csv")
>>> df
    barcode,is_cell,contig_id,high_confidence,length,chain,v_gene,d_gene,j_gene,c_gene,full_length,productive,cdr3,cdr3_nt,reads,umis,raw_clonotype_id,raw_consensus_id
0    AAACCTGCAACTGCTA-1,true,AAACCTGCAACTGCTA-1_con...                                                                
1    AAATGCCAGAGACTAT-1,true,AAATGCCAGAGACTAT-1_con...                                                                
2    AAATGCCAGAGACTAT-1,true,AAATGCCAGAGACTAT-1_con...                                                                
3    AAATGCCAGTACCGGA-1,true,AAATGCCAGTACCGGA-1_con...                                                                
4    AAATGCCAGTACCGGA-1,true,AAATGCCAGTACCGGA-1_con...                                                                
..                                                 ...                                                                
494  TTTCCTCTCTGATTCT-1,true,TTTCCTCTCTGATTCT-1_con...                                                                
495  TTTGTCAAGACCCACC-1,true,TTTGTCAAGACCCACC-1_con...                                                                
496  TTTGTCAAGACCCACC-1,true,TTTGTCAAGACCCACC-1_con...                                                                
497  TTTGTCATCCAGAAGG-1,true,TTTGTCATCCAGAAGG-1_con...                                                                
498  TTTGTCATCCAGAAGG-1,true,TTTGTCATCCAGAAGG-1_con...                                                                

So I'm a bit lost! Do you have any suggestions as to what might be going on? As I say I wonder if something has changed between versions because this was working with the same data with the previous dandelion version.

Cheers!

@christie-ga for reference

Thanks Ben, can you send me a couple of the tigger output folders so i can do some testing? It should be the folder names corresponding to the individuals.

at this stage, it seems like the validator in CreatGermlines.py is failing the file so i have to see what’s going on there, and possibly raise it as an issue for the changeo developers if i can’t fix it. Actually, if you just tun CreateGermlines.py directly on the genotyped.tsv file in the tigger output folder, does it work?

Cheers Kelvin, will send via email now.

I've run change-o CreateGermlines.py on the folder where it was getting stuck:

python ~/.conda/pkgs/changeo-1.1.0-pyhdfd78af_0/python-scripts/CreateGermlines.py -d E9LEH7P8_heavy_igblast_db-pass_genotyped.tsv -g dmask \
-r E9LEH7P8_heavy_igblast_db-pass_genotype.fasta \
~/dandelion/container/database/germlines/imgt/human/vdj//imgt_human_IGHD.fasta \
~/dandelion/container/database/germlines/imgt/human/vdj//imgt_human_IGHJ.fasta \
--vf v_call_genotyped

I get the same error:

(base) [hpcjaco1@login-q-4 E9LEH7P8]$ python ~/.conda/pkgs/changeo-1.1.0-pyhdfd78af_0/py                        thon-scripts/CreateGermlines.py -d E9LEH7P8_heavy_igblast_db-pass_genotyped.tsv -g dmask                         \
> -r E9LEH7P8_heavy_igblast_db-pass_genotype.fasta \
> ~/dandelion/container/database/germlines/imgt/human/vdj//imgt_human_IGHD.fasta \
> ~/dandelion/container/database/germlines/imgt/human/vdj//imgt_human_IGHJ.fasta \
> --vf v_call_genotyped
     START> CreateGermlines
      FILE> E9LEH7P8_heavy_igblast_db-pass_genotyped.tsv
GERM_TYPES> dmask
 SEQ_FIELD> sequence_alignment
   V_FIELD> v_call_genotyped
   D_FIELD> d_call
   J_FIELD> j_call
    CLONED> False

PROGRESS> 09:04:59 |                    |   0% (  0) 0.0 minTraceback (most recent call last):
  File "/home/hpcjaco1/.conda/pkgs/changeo-1.1.0-pyhdfd78af_0/python-scripts/CreateGermlines.py", line 354, in <module>
    createGermlines(**args_dict)
  File "/home/hpcjaco1/.conda/pkgs/changeo-1.1.0-pyhdfd78af_0/python-scripts/CreateGermlines.py", line 148, in createGermlines
    for key, records in receptor_iter:
  File "/home/hpcjaco1/.conda/pkgs/changeo-1.1.0-pyhdfd78af_0/python-scripts/CreateGermlines.py", line 133, in <genexpr>
    receptor_iter = ((x.sequence_id, [x]) for x in db_iter)
  File "/home/hpcjaco1/.local/lib/python3.7/site-packages/changeo/IO.py", line 75, in __next__
    record = next(self.reader)
  File "/home/hpcjaco1/.local/lib/python3.7/site-packages/airr/io.py", line 99, in __next__
    raise ValueError('row has extra data')
ValueError: row has extra data

I've had a look at the files and can't spot an obvious issue with funny characters or spaces etc that might be the cause. I've also run the singularity dandelion pre-processing individually on these samples (i.e. without a meta file) and it works fine, so i think it must be something to do with the parsing of >1 sample from an individual.

Thanks again for your help! It's such a useful package!

Out of curiosity, what if you add a prefix column instead of just sample_id?

Hey kelvin sorry for delay this just finished - exactly the same error despite adding a 'prefix' column to the --meta file...

Thanks @benjacobs123456

It looks like the concatenated file is malformed. You can see that there are extra columns with no headers:
image
see from BY onwards. This explains what the row has extra data mean.

Can you send me the original cellranger output files (associated with the file you sent me?) prior to running dandelion so i can chase this down further?

Thanks Kelvin, have emailed you the raw files (which look fine)

Dear both @benjacobs123456 @christie-ga

I think i know what's the problem now. just having a think about how to fix it. it's related to 'empty' filtered_contig_d_blast.tsv file for the CSF sample. It's probably correlated with the low number of contigs. Specifically, what's tripping the preprocessing script is this step:

if blast_result.shape[0] < 1:
blast_result = None
if blast_result is not None:

The file only has the header but no actual rows, so the shape is 0, resulting in it toggling None and essentially skipping the rest of the transfer (no d columns transferred). however, the PBMC sample had d annotations transferred. This then ends up with a mismatch during the concatenation, which the snowballs into the current issue😅

Sorry for that!

Let me have a think about the way to fix this and I will make another image available to you shortly.

Amazing thanks so much Kelvin! Great diagnostics :)