natsuhiko/rasqual

ASVCF Memory Usage

SaideepGona opened this issue · 6 comments

Are there any guidelines on the memory consumption of createASVCF.sh? I've run out of memory even when allocating 150GB. I don't really know how to best tell if this is normal or if I'm doing something wrong. There needs to be at least some usage guidelines on this I think.

In addition, ASEReadCounter doesn't output bam files but rather count tables. Is this format accepted automatically? How do we link these together?

Hi,

How may samples do you have?

The ASEReadCount output has to be manually combined with the VCF file by yourself.

Best regards,
Natsuhiko

I have 35 samples in this run.

I see. It's somewhat simpler on my end to be able to just run a single memory heavy job to do the work, but filtering and manual assignment would be the more distributed option.

By the way, I made a fork at: https://github.com/SaideepGona/rasqual, and have been working on a SLURM compatible luigi pipeline to kind of help automate the entire process (currently for RNAseq). As the primary author this might be something you'd find interesting, and I would appreciate your feedback as there are many moving parts

I found this: https://github.com/walaj/VariantBam

It allows for filtering a bam file based on a VCF to create a smaller bam file which can be used instead. I don't know how much of an improvement it will make in practice, but should help

So the original issue here I think is solved. I just wanted to follow up and ask about the assay_type parameter. Is it fair to use "atac" mode for other peak-based data? if not, what differences should exist? Thanks!

Sorry for the late reply. I was going to say you have to split the master VCF into chunks (e.g., 10Mb each) to save the memory usage.

You can use 'atac' option for other peak-based data (such as ChIP-seq, DNase-seq, etc.). The difference between RNA-seq and ATAC-seq is the insert size threshold (RNA-seq paired end reads easily span 10Kb or more if they are spliced.).