CRG-CNAG/CalliNGS-NF

The on-the-fly two-pass option could be used to avoid the genome regeneration step

kojix2 opened this issue · 3 comments

According to the STAR manual...

https://raw.githubusercontent.com/alexdobin/STAR/master/doc/STARmanual.pdf

8.3 2-pass mapping with re-generated genome.

This is the original 2-pass method which involves genome re-generation step in-between 1st and 2nd
passes. Since 2.4.1a, it is recommended to use the on the fly 2-pass options as described above.

It seems to say that genome regeneration is not recommended.

8.1 Multi-sample 2-pass mapping.
For a study with multiple samples, it is recommended to collect 1st pass junctions from all samples.

  1. Run 1st mapping pass for all samples with "usual" parameters. Using annotations is recommended either a the genome generation step, or mapping step.
  2. Run 2nd mapping pass for all samples , listing SJ.out.tab files from all samples in --sjdbFileChrStartEnd /path/to/sj1.tab /path/to/sj2.tab ....

Honestly, I am not sure what 2-pass mapping is, but maybe the following script can be improved by omitting the genome re-generation.

CalliNGS-NF/modules.nf

Lines 113 to 142 in 6492702

# ngs-nf-dev Align reads to genome
STAR --genomeDir $genomeDir \
--readFilesIn $reads \
--runThreadN $task.cpus \
--readFilesCommand zcat \
--outFilterType BySJout \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999
# 2nd pass (improve alignmets using table of splice junctions and create a new index)
mkdir genomeDir
STAR --runMode genomeGenerate \
--genomeDir genomeDir \
--genomeFastaFiles $genome \
--sjdbFileChrStartEnd SJ.out.tab \
--sjdbOverhang 75 \
--runThreadN $task.cpus
# Final read alignments
STAR --genomeDir genomeDir \
--readFilesIn $reads \
--runThreadN $task.cpus \
--readFilesCommand zcat \
--outFilterType BySJout \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:$replicateId LB:library PL:illumina PU:machine SM:GM12878

I leave this to @lucacozzuto

Hi, the original idea was to generate an index based on the annotation, align the reads and discover new splicing sites. They will be then used to generate another (improved) index. Finally you'll use this index for aligning the reads. I think is ok to change the code since now everything can be done in a single step.

Thanks!