error in scsnvmisc cells
monoplasty opened this issue · 7 comments
Hi,
Thank you for your amazing tool!
I encountered the error when I did scsnvmisc cells.
The code and error is bellow.
(scsnv) [root@scRNA scsnv]# scsnvmisc cells -o sample_data/ sample_data/map/mapsummary.h5
Error {key} not in the barcode data columns
(scsnv) [root@scRNA scsnv]# scsnvmisc cells --skip-mt -o sample_data/ sample_data/map/mapsummary.h5
Skipping MT DNA filtering
Total Passed Cells 164
(19566, 164) (19566, 164)
/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/flammkuchen-1.0.2-py3.8.egg/flammkuchen/hdf5io.py:302: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['gene_id', 'gene_name'], dtype='object')]
store.put(group._v_pathname + "/" + name, level)
(scsnv) [root@scRNA scsnv]# h5ls sample_data/map/mapsummary.h5
barcode_ids Dataset {182032}
barcode_rates Group
barcodes Dataset {182032}
exonic Group
gene_ids Dataset {21188}
gene_names Dataset {21188}
intronic Group
(scsnv) [root@scRNA scsnv]# h5ls sample_data/map/mapsummary.h5/barcode_rates
align_ambiguous Dataset {182032}
align_antisense Dataset {182032}
align_barcode_corrected Dataset {182032}
align_cdna Dataset {182032}
align_intergenic Dataset {182032}
align_intronic Dataset {182032}
align_multimapped Dataset {182032}
align_tag_qa_fail Dataset {182032}
align_total_reads Dataset {182032}
align_umi_qa_fail Dataset {182032}
align_unmapped Dataset {182032}
discarded_reads Dataset {182032}
field_order Dataset {19}
genes Dataset {182032}
intron_molecules Dataset {182032}
intron_pcr_dups Dataset {182032}
intron_reads Dataset {182032}
molecules Dataset {182032}
pcr_dups Dataset {182032}
reads Dataset {182032}
Can I know where I made a mistake through the information I provided? Looking forward to your reply.
Hi,
It looks like your second run with the --skip-mt
option worked. It just issued a warning about strings in the hdf5 file.
Thank you Gavin.
I'm trying to run the annotation currently. For the repeatmasker and REDIPortal files, I downloaded the repeatmasker file from https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/
and get hg38.fa.out.gz
file.
The complete execution code is as follows:
(scsnv) [root@scRNA ref_gene]# zless hg38.fa.out.gz | head
SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
463 1.3 0.6 1.7 chr1 10001 10468 (248945954) + (TAACCC)n Simple_repeat 1 471 (0) 1
3612 11.4 21.5 1.3 chr1 10469 11447 (248944975) C TAR1 Satellite/telo (399) 1712 483 2
484 25.1 13.2 0.0 chr1 11505 11675 (248944747) C L1MC5a LINE/L1 (2382) 395 199 3
239 29.4 1.9 1.0 chr1 11678 11780 (248944642) C MER5B DNA/hAT-Charlie (74) 104 1 4
318 23.0 3.7 0.0 chr1 15265 15355 (248941067) C MIR3 SINE/MIR (119) 143 49 5
18 23.2 0.0 1.9 chr1 15798 15849 (248940573) + (TGCTCC)n Simple_repeat 1 52 (0) 6
18 13.7 0.0 0.0 chr1 16713 16744 (248939678) + (TGG)n Simple_repeat 1 32 (0) 7
(scsnv) [root@scRNA ref_gene]# zless REDIportal_hg38.txt.gz | head
Region Position Ref Ed Strand db type dbsnp repeat Func.wgEncodeGencodeBasicV34 Gene.wgEncodeGencodeBasicV34 GeneDetail.wgEncodeGencodeBasicV34 ExonicFunc.wgEncodeGencodeBasicV34 AAChange.wgEncodeGencodeBasicV34 Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene Func.knownGene Gene.knownGene GeneDetail.knownGene ExonicFunc.knownGene AAChange.knownGene phastConsElements100way
chr1 87158 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87168 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87171 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87189 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87218 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87225 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87231 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87242 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87248 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
(scsnv) [root@scRNA ref_gene]# wget ftp://ftp.ensembl.org/pub/release-99/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
(scsnv) [root@scRNA ref_gene]# wget ftp://ftp.ensembl.org/pub/release-99/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz.csi
(scsnv) [root@scRNA ref_gene]# wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz
(scsnv) [root@scRNA ref_gene]# wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz.tbi
(scsnv) [root@scRNA ref_gene]# bcftools view -m2 -M2 -v snps -i "MAF>=0.01" -o 1000GENOMES-filtered.vcf.gz -O z 1000GENOMES-phase_3.vcf.gz
[W::bcf_hrec_check] Invalid tag name: "HGMD-PUBLIC_20184"
[W::vcf_parse_info] INFO 'AFR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'AMR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EAS' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EUR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'SAS' is not defined in the header, assuming Type=String
(scsnv) [root@scRNA ref_gene]# bcftools +fill-tags ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz -- -t AF | bcftools view -m2 -M2 -v snps -i "MAF>=0.01" -o ALL.chrMT-filtered.vcf.gz
(scsnv) [root@scRNA ref_gene]# zgrep --no-filename -E "^(#)" -v ALL.chrMT-filtered.vcf.gz | gzip > ALL.chrMT-filtered-no-header.vcf.gz
gzip: -v.gz: No such file or directory
(scsnv) [root@scRNA ref_gene]# zgrep --no-filename -E "^(#)\1+" -v 1000GENOMES-filtered.vcf.gz ALL.chrMT-filtered-no-header.vcf.gz | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz
gzip: -v.gz: No such file or directory
(scsnv) [root@scRNA ref_gene]# ls -lh
total 2.1G
-rw-r--r-- 1 root root 242M Jun 7 16:10 1000GENOMES-filtered.vcf.gz
-rw-r--r-- 1 root root 1.6G Jun 7 15:54 1000GENOMES-phase_3.vcf.gz
-rw-r--r-- 1 root root 1.8M Jun 7 15:26 1000GENOMES-phase_3.vcf.gz.csi
-rw-r--r-- 1 root root 1.6K Jun 7 16:49 1000GENOMES.txt.gz
-rw-r--r-- 1 root root 6.2K Jun 7 16:23 ALL.chrMT-filtered-no-header.vcf.gz
-rw-r--r-- 1 root root 55K Jun 7 16:12 ALL.chrMT-filtered.vcf.gz
-rw-r--r-- 1 root root 202K Jun 7 15:26 ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz
-rw-r--r-- 1 root root 119 Jun 7 15:26 ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz.tbi
-rw-r--r-- 1 root root 11K Jun 7 15:20 GRCh38_UCSC2ensembl.txt
-rw-r--r-- 1 root root 173M Jun 7 15:01 hg38.fa.out.gz
drwxr-xr-x 2 root root 4.0K Jun 5 18:46 index
-rw-r--r-- 1 root root 87M Jun 7 10:37 REDIportal_hg38.txt.gz
(scsnv) [root@scRNA ref_gene]# scsnvmisc annotate -r ref_gene/hg38.fa.out.gz -d ref_gene/1000GENOMES.txt.gz -e ref_gene/REDIportal_hg38.txt.gz sample_data/pileup/pileup
Processing sample_data/pileup/pileup
Kept 11361 out of 11641
[ True True True ... True True True] (11641,) (11641, 164) (11641, 164)
Parsing 1000 Genomes file
Traceback (most recent call last):
File "/usr/local/miniconda3/envs/scsnv/bin/scsnvmisc", line 4, in <module>
__import__('pkg_resources').run_script('scsnvpy==1.0', 'scsnvmisc')
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 720, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1559, in run_script
exec(code, namespace, namespace)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/EGG-INFO/scripts/scsnvmisc", line 37, in <module>
cmds.COMMANDS[cmd](args)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/annotate.py", line 61, in annotate_cmd
aux.annotate_g1000(ann._snvs, args.g1000)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/annaux.py", line 93, in annotate_g1000
df = pd.read_csv(fg1000, sep="\t", dtype={'chrom':object}, names=['chrom', 'pos', 'ref', 'alt'], header=0)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 583, in _read
return parser.read(nrows)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 1704, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 812, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 889, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._convert_column_data
pandas.errors.ParserError: Too many columns specified: expected 4 and found 1
When I run the following command, it works fine.
(scsnv) [root@scRNA scsnv]# scsnvmisc annotate -e ref_gene/REDIportal_hg38.txt.gz sample_data/pileup/pileup
Processing sample_data/pileup/pileup
Kept 11361 out of 11641
[ True True True ... True True True] (11641,) (11641, 164) (11641, 164)
Annotating edits
REDIportal 202
(scsnv) [root@scRNA scsnv]# ls -lh sample_data/pileup
total 59M
-rw-r--r-- 1 root root 4.4M Jun 7 17:03 pileup_annotated.h5
-rw-r--r-- 1 root root 54M Jun 7 10:08 pileup_barcode_matrices.h5
-rw-r--r-- 1 root root 3.0K Jun 7 10:08 pileup_barcodes.txt.gz
-rw-r--r-- 1 root root 251 Jun 7 17:10 pileup_filtering.txt
-rw-r--r-- 1 root root 244K Jun 7 17:03 pileup_passed_snvs.txt.gz
-rw-r--r-- 1 root root 247K Jun 7 10:08 pileup.txt.gz
Can the result obtained in this way be used for the scsnv snvcounts
command?
Hope to get your reply.
I'm just starting to learn about transcriptomes. So there will be a lot of questions, thank you very much for your patient reply, thank you again!
(scsnv) [root@scRNA ref_gene]# zgrep --no-filename -E "^(#)" -v ALL.chrMT-filtered.vcf.gz | gzip > ALL.chrMT-filtered-no-header.vcf.gz
gzip: -v.gz: No such file or directory
It seems like there's something wrong with your environment as each zgrep
commands is giving a strange error. As an alternative you can try:
bcftools concat ALL.chrMT-filtered.vcf.gz 1000GENOMES-filtered.vcf.gz | bcftools view --no-header | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz
Let me know if it works and I will replace the documentation with this.
On second thought you can just skip downloaded the MT 1000 genomes file and just use:
bcftools view --no-header -m2 -M2 -v snps -i "MAF>=0.01" 1000GENOMES-phase_3.vcf.gz | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz
First I run this command:
(scsnv) [root@scRNA ref_gene]# bcftools concat ALL.chrMT-filtered.vcf.gz 1000GENOMES-filtered.vcf.gz | bcftools view --no-header | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz
Checking the headers and starting positions of 2 files
[W::bcf_hrec_check] Invalid tag name: "HGMD-PUBLIC_20184"
[W::bcf_hrec_check] Invalid tag name: "HGMD-PUBLIC_20184"
Different number of samples in 1000GENOMES-filtered.vcf.gz. Perhaps "bcftools merge" is what you are looking for?
Failed to read from standard input: unknown file type
Then I run the following command:
(scsnv) [root@scRNA ref_gene]# bcftools view --no-header -m2 -M2 -v snps -i "MAF>=0.01" 1000GENOMES-phase_3.vcf.gz | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz
[W::bcf_hrec_check] Invalid tag name: "HGMD-PUBLIC_20184"
[W::vcf_parse_info] INFO 'AFR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'AMR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EAS' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EUR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'SAS' is not defined in the header, assuming Type=String
(scsnv) [root@scRNA ref_gene]# ls -lh
total 2.1G
-rw-r--r-- 1 root root 242M Jun 7 16:10 1000GENOMES-filtered.vcf.gz
-rw-r--r-- 1 root root 1.6G Jun 7 15:54 1000GENOMES-phase_3.vcf.gz
-rw-r--r-- 1 root root 1.8M Jun 7 15:26 1000GENOMES-phase_3.vcf.gz.csi
-rw-r--r-- 1 root root 39M Jun 8 09:11 1000GENOMES.txt.gz
-rw-r--r-- 1 root root 6.2K Jun 7 16:23 ALL.chrMT-filtered-no-header.vcf.gz
-rw-r--r-- 1 root root 55K Jun 7 16:12 ALL.chrMT-filtered.vcf.gz
-rw-r--r-- 1 root root 202K Jun 7 15:26 ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz
-rw-r--r-- 1 root root 119 Jun 7 15:26 ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz.tbi
-rw-r--r-- 1 root root 11K Jun 7 15:20 GRCh38_UCSC2ensembl.txt
-rw-r--r-- 1 root root 173M Jun 7 15:01 hg38.fa.out.gz
drwxr-xr-x 2 root root 4.0K Jun 5 18:46 index
-rw-r--r-- 1 root root 87M Jun 7 10:37 REDIportal_hg38.txt.gz
(scsnv) [root@scRNA ref_gene]# zcat 1000GENOMES.txt.gz | head
1 11012 C G
1 13110 G A
1 13116 T G
1 13118 A G
1 13273 G C
1 14464 A T
1 14930 A G
1 14933 G A
1 15774 G A
1 15777 A G
(scsnv) [root@scRNA scsnv]# scsnvmisc annotate -r ref_gene/hg38.fa.out.gz -d ref_gene/1000GENOMES.txt.gz -e ref_gene/REDIportal_hg38.txt.gz sample_data/pileup/pileup
Processing sample_data/pileup/pileup
Kept 11361 out of 11641
[ True True True ... True True True] (11641,) (11641, 164) (11641, 164)
Parsing 1000 Genomes file
Annotating the SNVs
Found in 1000 Genomes: 0
Annotating repeats
Traceback (most recent call last):
File "/usr/local/miniconda3/envs/scsnv/bin/scsnvmisc", line 4, in <module>
__import__('pkg_resources').run_script('scsnvpy==1.0', 'scsnvmisc')
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 720, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1559, in run_script
exec(code, namespace, namespace)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/EGG-INFO/scripts/scsnvmisc", line 37, in <module>
cmds.COMMANDS[cmd](args)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/annotate.py", line 62, in annotate_cmd
aux.annotate_repeats(ann._snvs, args.repeats)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/annaux.py", line 151, in annotate_repeats
df = pd.read_csv(frep, sep="\t", skiprows=1, header=None, usecols=[5,6,7,9,10,11,12],
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 583, in _read
return parser.read(nrows)
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 1704, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 812, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 889, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._convert_column_data
pandas.errors.ParserError: Too many columns specified: expected 7 and found 1
Sorry to keep asking you such stupid questions, and thank you so much for your patience! sincere thanks!
Hi,
The repeatmasker table you downloaded is the wrong table. You want to download the whole table from here.
The other thing to check is to make sure the chromosomes in your 1000 genomes text file match the chromosome style you used for cell ranger. Ie. chr1 or 1 etc.
Thanks a lot for your thoughtful reply! all the best