mousepixels/sanbomics_scripts

Problem to load files in Scanpy

victorsanchezarevalo opened this issue · 13 comments

Hi!

I am having some issues with these files that I got from GEO:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi

GSM3577882_normal_panc_barcodes.tsv.gz
GSM3577882_normal_panc_genes.tsv.gz
GSM3577882_normal_matrix.mtx.gz

This is my code: adata=sc.read_10x_mtx('./', prefix='GSM3577882_normal_',var_names='gene_symbols', cache=True )

I cannot load those samples in scampy. I have tried with different ways but with no results. So I decided to download de SRR files from GEO and practice with Cell Ranger following your video in you tube. Here I had another issue, these are the files that correspond to normal pancreas and correspond to two samples:

SRR8485290.fastq
SRR8485291.fastq
SRR8485292.fastq
SRR8485293.fastq

And I am not able to make Cell Ranger work, could it be because of how the samples are named?

Should a run each sample individually? How should I proceed with the tumor samples? Should I run independently each sample in Cell Ranger and the integrate the results in Scanpy?

I am really interested in this dataset and I don´t know what to do….

Thank you very much for your help and availability.

Victor

What is the error you get for the first command? For the second you do need to rename the files or cellranger wont work. I think I mention how to do it in the video. They should have R1 and R2 files as well, are those interleved?

The error that I got was Keyerror 2

Regarding the second error I have renamed the samples as you explained in the video, but I am doing something wrong because it is not working. I don’t know if the samples are interleaved, but I am sure there are only two samples in the control mice.

thank you very much for your help!

Víctor

Try putting them in their own directory instead of your working directory and rename them to the default 10x file names. Remove the prefix and var_names arguments. See if that works.

I put the samples in their own directory:

GSM3577883_early_KIC_/

Check the files in the directory

ls GSM3577883_early_KIC_/

GSM3577883_early_KIC_barcodes.tsv.gz GSM3577883_early_KIC_matrix.mtx.gz
GSM3577883_early_KIC_genes.tsv.gz

run sc:

adata=sc.read_10x_mtx('./GSM3577883_early_KIC_', gex_only= False, cache=True )

I tried:

adata=sc.read_10x_mtx('./GSM3577883_early_KIC_', var_names='gene_symbols', cache=True )

Nothing worked, this is the error:

FileNotFoundError: Did not find file GSM3577883_early_KIC_/matrix.mtx.gz.

But the file is in the directory....

I don´t understand it.

Best

Regarding the cell ranger, i have change the names and it runs but give me this error:

[error] Pipestance failed. Error log at:
normal_sample/SC_RNA_COUNTER_CS/SC_MULTI_CORE/MULTI_CHEMISTRY_DETECTOR/_GEM_WELL_CHEMISTRY_DETECTOR/DETECT_COUNT_CHEMISTRY/fork0/chnk0-u23993fce1d/_errors

Log message:
FASTQ header mismatch detected at line 4 of input files "/home/victor/normal/normal_S2_L001_R1_001.fastq" and "/home/victor/normal/normal_S2_L001_R2_001.fastq": file: "/home/victor/normal/normal_S2_L001_R1_001.fastq", line: 4

Waiting 6 seconds for UI to do final refresh.
Pipestance failed. Use --noexit option to keep UI running after failure.

2022-10-07 08:58:43 Shutting down.
Saving pipestance info to "normal_sample/normal_sample.mri.tgz"
For assistance, upload this file to 10x Genomics by running:

I have done a head of the file and this is wat I have:

@SRR8485292.1 NS500579:414:HMYLFBGX5:3:11401:8647:1026 length=99
CATTTCCTGGATGAATAATGTCATTGCCTCCAACTGAGCGACTTCCGGAGTCCAGTGGCCTCCCCAAGATGGCTCTTAGCTTTGCAATGAGACCTGAAG
+SRR8485292.1 NS500579:414:HMYLFBGX5:3:11401:8647:1026 length=99
/AAAAEEEEEEEEE/EAEEEEEEEEEEEEEEEEEEEEEE//AAEAEAEEEEEEEEEEAEEEAEEE/EE<EE<EEEE/A<EEEEE</A<EAE<<EAE/<A
@SRR8485292.2 NS500579:414:HMYLFBGX5:3:11401:12222:1026 length=99
ACCAAGGTCTGCAACTACGTGGACTGGATTCAGAACACAATTGCTGACAACTAGAGAACCCTAGTCTCTCTTCAATCAGTATTATCAATAAAGTTCATT
+SRR8485292.2 NS500579:414:HMYLFBGX5:3:11401:12222:1026 length=99
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEAE6EE/EEEEEEEEEAEEEE<EEEE</EEEEEEEEEEEEEEEAEEEEEEAEE/A<6/AEAEEE<EEEA
@SRR8485292.3 NS500579:414:HMYLFBGX5:3:11401:26735:1026 length=99
AAGGAAATGAGGAGAAAAGTATTTGTACTGTATAATGGAGGCTGACCAGAGCAGTTTAGGAGATTGTAAAGGGAGGTTTTGTGAAGTTCTAAAAGGTTC

Sorry for bothering you so much!

You need to rename the files in that folder (for the scanpy issue). Find the default file names. The error shows that it is looking for matrix.mtx.gz, but you have no file named matrix.mtx.gz in the folder.

You can also just use the H5 file if you have it

Can you head both the R1 and R2

Is that only for the R1? Can you do it for the R2 too

I have renamed the files in the folder GSM3577883_early_KIC_:

barcodes.tsv.gz features.tsv.gz matrix.mtx.gz

Then I run the code:

adata=sc.read_10x_mtx('./GSM3577883_early_KIC_', var_names='gene_symbols', cache=True )

I have got this error:

KeyError Traceback (most recent call last)
Cell In [5], line 1
----> 1 adata=sc.read_10x_mtx('./GSM3577883_early_KIC_', var_names='gene_symbols', cache=True )

File ~/miniconda3/envs/scrnaseq/lib/python3.10/site-packages/scanpy/readwrite.py:490, in read_10x_mtx(path, var_names, make_unique, cache, cache_compression, gex_only, prefix)
488 genefile_exists = (path / f'{prefix}genes.tsv').is_file()
489 read = _read_legacy_10x_mtx if genefile_exists else _read_v3_10x_mtx
--> 490 adata = read(
491 str(path),
492 var_names=var_names,
493 make_unique=make_unique,
494 cache=cache,
495 cache_compression=cache_compression,
496 prefix=prefix,
497 )
498 if genefile_exists or not gex_only:
499 return adata

File ~/miniconda3/envs/scrnaseq/lib/python3.10/site-packages/scanpy/readwrite.py:571, in _read_v3_10x_mtx(path, var_names, make_unique, cache, cache_compression, prefix)
569 else:
570 raise ValueError("var_names needs to be 'gene_symbols' or 'gene_ids'")
--> 571 adata.var['feature_types'] = genes[2].values
572 adata.obs_names = pd.read_csv(path / f'{prefix}barcodes.tsv.gz', header=None)[
573 0
574 ].values
575 return adata

File ~/miniconda3/envs/scrnaseq/lib/python3.10/site-packages/pandas/core/frame.py:3805, in DataFrame.getitem(self, key)
3803 if self.columns.nlevels > 1:
3804 return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
3806 if is_integer(indexer):
3807 indexer = [indexer]

File ~/miniconda3/envs/scrnaseq/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
3800 return self._engine.get_loc(casted_key)
3801 except KeyError as err:
-> 3802 raise KeyError(key) from err
3803 except TypeError:
3804 # If we have a listlike key, _check_indexing_error will raise
3805 # InvalidIndexError. Otherwise we fall through and re-raise
3806 # the TypeError.
3807 self._check_indexing_error(key)

KeyError: 2

file:///home/victor/Im%C3%A1genes/Captura%20de%20pantalla%20de%202022-10-13%2008-39-54.png

This is how it looks the features.tsv.gz file

head SRR8485290.fastq

@SRR8485290.1 NS500579:414:HMYLFBGX5:1:11101:24172:1037 length=99
NGAAANTNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNCNNGNGNNCCANGNAGACNAGAGCCATGCGCCGCCGGCTCACCCAGCACGAGGAGAAGCNNA
+SRR8485290.1 NS500579:414:HMYLFBGX5:1:11101:24172:1037 length=99
#AAAA#E##################/###########/##/#/##/EA#/#/A/<#/</6<<<AE//EAAAAA/AEEAE</6/A/EEE///AEAAA##/
@SRR8485290.2 NS500579:414:HMYLFBGX5:1:11101:18941:1037 length=99
NCGCCNANANNNNNNNNNNNNNNNNGNNNNNNNNNNNCNNCNCNNCGCNCNCGGCNCGCGGGGGGGGGGTGGGGGGGGTTGGGGGGGGGCGGGGGCNNC
+SRR8485290.2 NS500579:414:HMYLFBGX5:1:11101:18941:1037 length=99
#////#/#/################/###########/##E#/##/</#6#A///#<///////////E/A//////A///A/////E////EE//##/
@SRR8485290.3 NS500579:414:HMYLFBGX5:1:11101:16344:1037 length=99
NTGATNGNGNNNNNNNNNNNNNNNNANNNNNNCNNNNANNTNANCCCCNCNGCACNGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCACTGCCNNG

head SRR8485291.fastq

@SRR8485291.1 NS500579:414:HMYLFBGX5:2:11101:26051:1037 length=99
GCAATGGNNNNCTGCANNNNATNNTNNNNTGGGGCNNNGGCTGTNNCNAGCCAGNTNCTCCTGGTGTATACACCAAGGTCTGCAACTACGTGGACTGGA
+SRR8485291.1 NS500579:414:HMYLFBGX5:2:11101:26051:1037 length=99
AAAAAEE####EEEEE####EE##E####EEEEEE###/EEEEE##/#/EEEAE#A#EEEEEEEEEEAEEEEEEEAEEEEEA<AAEE//<A<6EAEAE/
@SRR8485291.2 NS500579:414:HMYLFBGX5:2:11101:15791:1037 length=99
ACCTCCANNNNCAATGNNNNCTNNGNNNNACCACTNNNTTGCCGNNNNTCTANTNGNCAGTGGCAGGTGCATGGCATCGTGAGCTTCGNCTCCTCTCTG
+SRR8485291.2 NS500579:414:HMYLFBGX5:2:11101:15791:1037 length=99
AAAAAEE####E/EEE####EE##E####/EEEEE###EEEAE/####EEEE#E#E#EEEEEEEEEAEEEEEEEEEEEEEE<AEAAAE#EE<<6/6<AA
@SRR8485291.3 NS500579:414:HMYLFBGX5:2:11101:15818:1037 length=99
CGCCAGTNNNNTCTCTNNNNGCNNGNNNNACTCTGNNNGCCCCCNNNNCTGCNANANGGCCGGAGTCGGGACCCTGGCCCGCATTGTGNCCTGGGGCAG