thegenemyers/DAMASKER

Pipeline for DAMASKER

Closed this issue · 2 comments

Dear Eugene,
I would like to use your tools which you described on your DAZZLERBLOG for a tetraploid plant genome assembly project.

Reading through your blog it appears to me that there are following 13 steps involved:

Step 1. # dextract -o pacbio/*bam
Step 2. # fasta2DB plant *pacbio-reads.fasta
Step 3. # DBdust plant
Step 4. # DBsplit -x1000 plant
Step 5. # HPC.daligner plant -T 8  | bash -v
Step 6. # rm plant.*.plant.*.las
Step 7. # LAmerge plant.las plant.[0-9].las
Step 8. # DASqv -c50 plant plant.las
Step 9. # HPC.TANmask plant
Step 10. # HPC.REPmask -g1 -c20 -mtan plant
Step 11. # HPC.REPmask -g10 -c15 -mtan -mrep1 plant
Step 12. # HPC.REPmask -g100 -c10 -mtan -mrep1 -mrep10 plant
Step 13. # HPC.daligner -mtan -mrep1 -mrep10 -mrep100 plant

Do you think the above commands are correct and whether the output from the plant DB could be used for Racon pipeline (https://github.com/isovic/racon)?

Thank you in advance

Michal

Hi Gene,
Thank you. I change the pipeline to the following steps created by this script:

Creating database

source activate thegenemyers
find /work/waterhouse_team/All_RawData/Each_Cell_Raw/ -name "*.arrow" -type f > fasta2DB_input.fofn
sed -i.bak 's|.arrow|.fasta|g' fasta2DB_input.fofn
fasta2DB DB -ffasta2DB_input.fofn

DBsplit -x500 -s250 DB
DBdust DB
Catrack -v DB dust

HPC.TANmask

source activate thegenemyers
HPC.TANmask DB -mdust -T4 -fTANmask
sh HPC.parallel_pbs.sh TANmask.01.OVL             #MEM:9GB; CPU time:00:06:25
sh HPC.parallel_pbs.sh TANmask.02.CHECK.OPT       #MEM:0.3GB; CPU time:00:00:02
sh HPC.parallel_pbs.sh TANmask.03.MASK            #MEM:0.4GB; CPU time:00:00:01  
sh TANmask.04.RM
qsub catrackTAN_pbs.sh
#      Catrack -v DB tan
#      rm .DB.*.tan.*
PBS Job 2678948.pbs
CPU time  : 00:00:01
Wall time : 00:00:11
Mem usage : 8164kb

HPC.REPmask

source activate thegenemyers
HPC.REPmask -g1 -c20 -mdust -mtan DB -T4 -fREPmask
sh HPC.parallel_pbs.sh REPmask.01.OVL           #MEM:30GB; CPU time:02:00:01
sh HPC.parallel_pbs.sh REPmask.02.CHECK.OPT     #MEM:1.4GB; CPU time:00:00:03
sh HPC.parallel_pbs.sh REPmask.03.MASK          #MEM:0.01GB; CPU time:00:00:06
sh REPmask.04.RM
Catrack -v DB rep1
rm .DB.*.rep1.*

HPC.daligner

source activate thegenemyers
DBstats -b1 -mdust -mtan -mrep1 DB > DBstats.out

/work/waterhouse_team/apps/bin> python  calc_cutoff.py --genome_size 1800000000 --coverage 38 --db_stats
/work/waterhouse_team/banana/assembly/DBstats.out
6973

HPC.daligner -mdust -mtan -mrep1 -H6973 -T4 -fdaligner DB   
sh HPC.parallel_pbs.sh daligner.01.OVL
sh HPC.parallel_pbs.sh daligner.02.CHECK.OPT  
sh HPC.parallel_pbs.sh daligner.03.MERGE  
sh HPC.parallel_pbs.sh daligner.04.CHECK.OPT  
sh HPC.parallel_pbs.sh daligner.05.RM.OPT  
sh HPC.parallel_pbs.sh daligner.06.MERGE  
sh HPC.parallel_pbs.sh daligner.07.CHECK.OPT  
sh daligner.08.RM