basilkhuder/Seurat-to-RNA-Velocity

seurat object to RNA velocity question

skim245 opened this issue · 15 comments

Thank you for sharing tutorials with codes !
I'm new to python and stuck at first step of generating non-spliced and spliced RNAs for RNA velocyto analysis !
I have selected cell population in seurat object and would like to know if I can extract those non-spliced and spliced RNA information from this or do i have to go back to original sequencing files i.e. bam/fasq files?

Unfortunately, you'll have to go back to your BAM/FASTQ files. Seurat objects won't contain the necessary information to run RNA Velocity. I used to prefer to use Velocyto's RUN command to generate a loom file (which will have non-spliced and spliced counts needed.)

velocyto run -b filtered_barcodes.tsv -o output_path -m repeat_msk_srt.gtf possorted_genome_bam.bam reference_annotation.gtf

The full reference can be found here. The -m command is optional.

However, lately, I have been using Kalisto Bustools to generate loom files, as it is much faster than Velocyto RUN (KB will use your FASTQ files.) I just edited my tutorial to add a section on KB.

Hi basilkhuder,
I've managed to download SRA files from server and converted them to fastq files.
and based on your recent updates on tutorial, i can either use KB two-step process or Velocyto's RUN command to generate a LOOM file right?

I would like to try your recommendation KB and i want to make sure i do understand the codes !
I've downloaded both FAST and GTF files for my species (from Ensembl link provided in the tutorial)
Mus_musculus.GRCm38.100.gtf.gz
Mus_musculus.GRCm38.cdna.all.fa.gz

and have let's say have 5 fastq files
1.fastq
2.fastq
3.fastq
4.fastq
5.fastq
and if i'm going to run the code below to use KB, i'm not quite sure where I put reference file names and fastq files !

1st step
kb ref -i index.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow lamanno
fasta.fa
gtf.gtf

2nd step
kb count -i transcriptome.idx -g t2g.txt -x 10xv2 --lamanno --loom -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt read_1.fastq.gz read_2.fastq.gz

could you elaborate further on these codes?

Thank you

Everything looks good! For the kb ref step, the fasta and GTF files just go at the very end:

kb ref -i index.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow lamanno fasta.fa gtf.gtf

If you run into any errors trying the code, let me know.

I've tried below codes
kb ref -i transcriptome.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --lamanno Mus_musculus.GRCm38.cdna.all.fa Mus_musculus.GRCm38.100.gtf

in this code I could not use -- workflow lamanno

as I was getting an error : error: unrecognized arguments: --workflow

but if i just use lamanno --> which creates six files
cdna.fa
cdna_t2c.txt
intron.fa
intron_t2c.txt
t2g.txt
transcriptome.idx

but i cannot run the second code kb count
and gets this error message

kb: error: unrecognized arguments: -f1 -f2 intron.fa

I think this is because except transcriptome.idx file, all the other files are empty
and thus are unrecognized.

The -- workflow lamanno only works on the latest KB-python version. Either way, I think you are getting empty files because you are using a cDNA reference genome rather than the DNA primary assembly. Download these files instead:

wget ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz

wget ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz

And run the following code:

kb ref -i transcriptome.idx -g t2g.txt -f1 cnda.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 introns_t2c.txt --lamanno Mus_musculus.GRCm38.dna.primary_assembly.fa.gz Mus_musculus.GRCm38.98.gtf.gz

I just tested this myself and it successfully generated the needed files for kb count

thank you basilkhuder

the codes you've provided seems to be running, but my terminal is stuck at last step
Indexing to transcriptome.idx

I'm running this program on Mac 32GB
and it seems like it's been on for like 18 hrs and I'm not sure if this is usual

kb ref -i transcriptome.idx -g t2g.txt -f1 cnda.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 introns_t2c.txt --lamanno Mus_musculus.GRCm38.dna.primary_assembly.fa.gz Mus_musculus.GRCm38.98.gtf.gz
[2020-05-16 03:01:29,525] INFO Decompressing Mus_musculus.GRCm38.dna.primary_assembly.fa.gz to tmp
[2020-05-16 03:01:44,636] INFO Sorting tmp/Mus_musculus.GRCm38.dna.primary_assembly.fa
[2020-05-16 03:07:41,644] INFO Decompressing Mus_musculus.GRCm38.98.gtf.gz to tmp
[2020-05-16 03:07:43,832] INFO Sorting tmp/Mus_musculus.GRCm38.98.gtf
[2020-05-16 03:08:28,293] INFO Splitting genome into cDNA at cnda.fa
[2020-05-16 03:09:12,983] INFO Creating cDNA transcripts-to-capture at cdna_t2c.txt
[2020-05-16 03:09:13,940] INFO Splitting genome into introns at intron.fa
[2020-05-16 03:12:25,162] INFO Creating intron transcripts-to-capture at cdna_t2c.txt
[2020-05-16 03:12:30,973] INFO Concatenating cDNA and intron FASTAs
[2020-05-16 03:12:38,094] INFO Creating transcript-to-gene mapping at t2g.txt
[2020-05-16 03:12:45,986] INFO Indexing to transcriptome.idx

I can't say why the process seems still to be running, but I can say that 32GB won't be enough RAM to run KB, let alone the rest of your RNA Velocity analysis. In the newer version of KB, there's a -n parameter that splits the index into parts, which will require less memory. But, I am concerned that even if you get past creating your velocity index files, that you'll run into issues when you import your single-cell data. Is there any way you can use a high-performance computer with high levels of RAM to run this analysis?

sadly I do not have any alternative options, do you think I might be able to get around with this low RAM issue by using server based python?

basilkhuder, I found a way to use a high-performance computer and successfully generated kb ref.
now I'm having trouble getting around with kb count.
where I get this error

"kb: error: unrecognized arguments: -f1 -f2 intron.fa"

basilkhuder, I found a way to use a high-performance computer and successfully generated kb ref.
now I'm having trouble getting around with kb count.
where I get this error

"kb: error: unrecognized arguments: -f1 -f2 intron.fa"

Great! Can you send over your command? Also, which version of KB-Python did you install?

I've installed kb-python-0.24.4
and ran these command

kb ref -i index.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow lamanno -n 4 Mus_musculus.GRCm38.dna.primary_assembly.fa.gz Mus_musculus.GRCm38.98.gtf.gz

--which successfully generated 6 files
##cdna.fa
##cdna_t2c.txt
##intron.fa
##intron_t2c.txt
##t2g.txt
##transcriptome.idx

-- and ran below command

kb count -i transcriptome.idx -g t2g.txt -x 10xv3 --lamanno --loom -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt SRR10870267.fastq.gz fastq.gz SRR10870268.fastq.gz

-- and got these errors
kb: error: unrecognized arguments: -f1 -f2 intron.fa SRR10870267.fastq.gz fastq.gz SRR10870268.fastq.gz

@skim245 - You don't need the -f1 and -f2 tags with those fasta files for kb count.

@basilkhuder By any chance have you had successful results using the index splitting in the kb? I'm using the devel branch, and split my velocity index into 8 parts. But kb count doesn't seem to work. For example, if you use -i index.idx -n 8, you get back index.idx_cdna and then 7 index.idx_intron.x files (x from 0 to 6). Then, in kb count, I tried setting -i index.idx - but this isn't recognized.

I've installed kb-python-0.24.4
and ran these command

kb ref -i index.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow lamanno -n 4 Mus_musculus.GRCm38.dna.primary_assembly.fa.gz Mus_musculus.GRCm38.98.gtf.gz

--which successfully generated 6 files
##cdna.fa
##cdna_t2c.txt
##intron.fa
##intron_t2c.txt
##t2g.txt
##transcriptome.idx

-- and ran below command

kb count -i transcriptome.idx -g t2g.txt -x 10xv3 --lamanno --loom -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt SRR10870267.fastq.gz fastq.gz SRR10870268.fastq.gz

-- and got these errors
kb: error: unrecognized arguments: -f1 -f2 intron.fa SRR10870267.fastq.gz fastq.gz SRR10870268.fastq.gz

@skim245 Apologies for the late reply! I don't get notifications for replies to issues. @skannan4 is completely right, omit those tags. I'm going to do a large edit on my tutorial in the upcoming days, so I'll get this fixed.

@skim245 - You don't need the -f1 and -f2 tags with those fasta files for kb count.

@basilkhuder By any chance have you had successful results using the index splitting in the kb? I'm using the devel branch, and split my velocity index into 8 parts. But kb count doesn't seem to work. For example, if you use -i index.idx -n 8, you get back index.idx_cdna and then 7 index.idx_intron.x files (x from 0 to 6). Then, in kb count, I tried setting -i index.idx - but this isn't recognized.

I haven't. However, I just did a quick test run last night using those option and ran into the exact problems as you. I've had plenty of other problems with KB-python - which is a shame since Velocyto Run takes much, much longer and requires more resources.

Darn. Thanks for checking. I've raised the issue on the kb-python github, so we'll see if they have a solution. In the meantime, suppose I can just map on AWS. Sucks a little because the reason I switched to kallisto was so I could map locally but it is what it is. Thanks for this tutorial though - very very useful to have.