dcjones/proseg

Proseg fails to read any transcripts from some .csv.gz files

Opened this issue · 7 comments

Hi,

I was coming across a lot of missing FOVs and saw you recently updated proseg to 1.0.6 which has some more operability with CosMx data and may help with the issue (#26 ), however I run into an error. I'll put the error and a sample of the transcript file below. Note: I managed to run this on 1.0.5 by editing the column names, so the data should be OK, also removing the --use-cell-initialization flag results in the same error.

(base) gordonbeattie@192 L1_SU500 % proseg -V
proseg 1.0.6
(base) gordonbeattie@192 L1_SU500 % proseg --cosmx L1_SU500_tx_file.csv.gz --use-cell-initialization
Using 8 threads
thread 'main' panicked at /Users/gordonbeattie/.cargo/registry/src/index.crates.io-6f17d22bba15001f/proseg-1.0.6/src/main.rs:511:18:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
(base) gordonbeattie@192 L1_SU500 % gzip -cd L1_SU500_tx_file.csv.gz| head
fov,cell_ID,cell,x_local_px,y_local_px,x_global_px,y_global_px,z,target,CellComp
1,0,c_1_1_0,4243,67,66309.5474243164,55391.5182749431,7,Scd2,None
1,0,c_1_1_0,4243,1035,66309.4480832418,54423.7534205119,1,Tmsb4x,None
1,0,c_1_1_0,4242,1627,66308.8361422221,53831.2435150147,3,Atp1a2,None
1,0,c_1_1_0,4242,1898,66308.2679112752,53560.6026649475,1,Pcp4,None
1,0,c_1_1_0,4243,2704,66309.7858428955,52753.8736661275,1,Pfkm,None
1,0,c_1_1_0,4243,3836,66309.1977437337,51622.1841176351,1,Cst7,None
1,0,c_1_1_0,4243,3880,66309.7858428955,51577.8342882792,5,Mdh1,None
1,0,c_1_1_0,4242,3863,66308.2281748454,51595.2348709107,3,Camk2a,None
1,0,c_1_1_0,4242,3846,66308.2679112752,51611.9321187337,5,Rps9,None

Thanks in advance for any assistance!

All the best,
Gordon

Hi Gordon,

This seems that no transcripts were read for some reason. I'm not sure what's going on here, but can you confirm that the cell_ID column has some non-zero values in this data?

Thanks for the response, I can confirm the cell_ID has some non-zero values, although most of them are 0. I'll put a few metrics below to give a little more insight.

> head(table(tx.list$Nanostring$cell_ID))
       0        1        2        3        4        5 
18825028    60665    53141    58098    63807    59214 

> length(unique(tx.list$Nanostring$cell_ID))
[1] 1484

> length(unique(tx.list$Nanostring$fov))
[1] 169

> length(unique(tx.list$Nanostring$cell))
[1] 160243

Having the same issue trying to run on CosMX

proseg --cosmx Diana_HEM_CR_FF_EM_NR4A145koko315pA7_STJ_N_R1_tx_file.csv.gz
Using 192 threads
thread 'main' panicked at /home/fsegato/.cargo/registry/src/index.crates.io-6f17d22bba15001f/proseg-1.1.0/src/main.rs:521:18:
index out of bounds: the len is 0 but the index is 0
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Having the same issue with CosMx Data

proseg --cosmx S22113961S22113960_tx_file.csv.gz --use-cell-initialization
Using 24 threads
thread 'main' panicked at /home/asmilags/.cargo/registry/src/index.crates.io-6f17d22bba15001f/proseg-1.1.3/src/main.rs:522:18:
index out of bounds: the len is 0 but the index is 0
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::panicking::panic_bounds_check
3: proseg::main
note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace.

And the same with Xenium data processed with xenium_ranger version 3: proseg --xenium transcripts.csv.gz

Using 2 threads
thread 'main' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/proseg-1.0.0/src/main.rs:474:18:
index out of bounds: the len is 0 but the index is 0
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Each of these reported errors is due to no transcripts being read by proseg, but I've not been able suss out where the loss might be occurring, and haven't been able to reproduce it.

If someone would be so kind as to email (or otherwise send) me a data to reproduce it, I'll fix this right away. I suspect that if this error happens with the full transcripts file, the first 10k lines or so should generate the same error and be small enough to email.

I just release version 1.1.5 which I believe fixes this issue. Thanks for your patience, and thanks to @dbuszta for sharing a dataset to reproduce the issue.