GWW/scsnv

How do I...

gitUser128954 opened this issue · 1 comments

I think this is a great implementation of a fundamental concept with clear utility. But, how do I read the data from a "serialized python flammkuchen" file into R? Is there a clean handoff from a flammkuchen file to downstream analysis pipeline(s)? Any recommendations on how to identify cell clusters from the mutation and expression data (via R or ‘not R’)?

GWW commented

Hi,

I do have a subcommand for scsnvmisc that will convert the annotated pileup h5 file to a reference and alternative counts in market matrix format and a vcf file. I don't use R very much but I am sure there are libraries that can parse the vcf and market matrix files.

The pileup annotate tool also writes an annotated text file with information about each SNV (pileup_passed_snvs.txt.gz). It does require a tab separated chromsome lengths text file for the vcf header. You can generate this file from a samtools faidx indexed file:

samtools faidx genome.fa
cut -f 1,2,3 genome.fa.fai > chrom_lengths.txt

For example, this will write all sites that do not overlap annotated RNA edits:

scsnvmisc snv2vcfmtx -r chrom_lenghts.txt -f genome.fa -o output_folder -e -c pileup_annotated.h5

This will produce:

output_folder/barcodes.txt #list of barcodes
output_folder/snvs.vcf #basic SNV vcf file
output_folder/refs.mtx #Reference count market matrix file
output_folder/alts.mtx #Alternative count market matrix file

Unfortunately, I have not done much work clustering mutations.