batch_4DTV_calculation

Calculate 4DTV (transversion rate on 4-fold degenerated sites) faster with batch axt files

This script was based on the "calculate_4DTV_correction.pl"

The old script can only calculate 4DTV for a pair of sequences at a time, which are contained in an axt file. In general, there is nothing wrong with this approach. However, in the case of heavy computation, this not only slows down the progress, but also tends to cause the process to crash.

Merge axt files

If you have many axt files:

example1.axt

seq1-seq2
ATGTCTCATATGTCTTCTGTGAACGCGAAAAATCTTCAAAAACTAGCAGATTCAATTGTC
AAACATGTAAAGCACTTTAACAATAATGAAGTTTTGTGTCTGATCAAACTCTTCAATGTG
CTGATGGGAGAGCAGAGCGAGCACAGGGTTGGAAATGGACTGGATCGTGGTAAATTCAGG
AGCATCCTCCACAACACATTTGGAATGACAGATGACATGATTATGGACAGAGTCTTCCGT
GCATTTGACAAGGACAATGATAGCAACGTCAGTGTAAAAGAATGGATAGAAGGACTTTCA
GTGTTTCTGCGAGGGACCTTGgatgaaaaaattaaatATTGTTTTGAGGTTTATGACTTA
AATGGGGATGGATATATTTCACGAGAAGAGATGTTCCACATGCTGAAAAACAGTCTCATA
AAACAACCAACAGAAGAAGATCCAGATGAAGGGATTAAGGACTTGGTAGAGATTACTCTT
AAAAAGATGgaCCACGATCACGACAGCAGACTTTCATACGCTGATTTTGAGAAAGCAGTA
AAAGAAGAAAATCTCTTGCTTGAGGCTTTTGGAGCTTGTCTTCCTGATGCAAAGagtaTT
CTTGCTTTTGAAAGACAGGCCTTCCAG------GATACCACAGAAAAT
atgctgaaaatgTCGGCGATGAACAGAAAATTAATTCAAAACCTCGCCGAGACTTTATGC
AGACAAGTCAAACATTTTAATAAAACAGAGACGGAGTGTCTGATAAGGCTGTTCAACAGT
CTGCTGGGAGAGCAGGCAGAGAGAAAGACGACTATTGGAGTGGACCGGGCCAAATTCAGA
AATATACTGCACCACACTTTCGGGATGACCGACGACATGATGACGGACAGAGTTTGTCGT
GTCATTGACAAGGACAACGATGGCTACTTAAGCGTTAAAGAGTGGGTTGAggctctgtct
gtctttctaagAGGCACACTGGATGAAAAAATGAAATaCTGTTTTGAGGTGTATGACCTG
AACGGGGATGGATACATCTCACGTGAGGAGATGTTTCAGATGCTGAAAGACAGCCTCATC
AGGCAGCCCACCGAAGAGGATCCTGATGAGGGGATTAAGGATATTGTGGAGATTGCCTTG
AAAAAAATGGATTATGACCATGATGGAAGAGTTTCTTATGCTGATTTTGAGAAGACGGTC
ATGGATGAAAACCTTTTACTAGAAGCTTTTGGAAACTGCCTTCCTGATGCAAAGAGTGTA
CTAGCATTTGAGCAACAGGCATTCCAGAAACACGAACACTGCAAAGAA

example2.axt

seq3-seq4
ATGGATCGCCATTCCAAtttaatttccatttggctgcaACTGGAACTGTGTGCCATGGCA
GTACTTCTGGCAAAAGGGGAGATAAGATGCTACTGTGATGCAGCGCATTGTGTGGCAACA
GGTTACATGTGTAAATCCGAGTTAAATGCCTGCTTCACCAGGCTTCTGGACCCACAGAAC
ACAAACTCCCCTCTCACGCATGGCTGCTTGGACCCGACTGCAAACACAGCAGATGTTTGC
CATGCTGGAAGGACAGAGAGCCGCGCTGGGGCCTCGGAGAAGCTTGAGTGCTGTCACGAC
GATATGTGCAATTACAGAGGACTCCATGATGTTGTTTCATATCCCAGGGGGGACAGCTCA
GATCATGGAACAAGATATCAGCCAGACAGTAGCAGGAATCTTCTGACCAGGGTTCAGGAT
TTAACATCCTCTAAAGAGCTGTGGTTCAGAGCAGCCGTGATCGCTGTGCCCATCGCTGGG
GGGCTCATTCTAGTGCTTCTCATCATGCTCGCCTTGCGGATGCTTCGAAGTGAAAACAAA
AGACTGCAGGACCAGAGGCAGCAGATGCTGTCCCGCTTGCACTACAACTTTCATGGA---
CACCACACGAAGAAGGGCCAGGTAGCCAAACTGGATTTGGAATGCATGGTTCCCGTAACC
GGACACGAGAACTGCTGTATGACTTGCGACAAACTGCGACAGTCTGAACTCCACAAT---
---------------GATAAATTGCTGTCTTTAGTTCACTGGGGAATTTACAGCGGTCAC
GGGAAATTGGAATttgta
ATGGATCGC---------CTGGTTTCTCTGTGGTTTCAGCTGGAACTTTGTGCGATGGCT
GTTCTTCTCACGAAAGGAGAGATCAGGTGCTACTGTGACGCACCGCACTGCGTTGCCACC
GGATACATGTGTAAATCAGAGCTCAACGCTTGCTTTACTAAGGTCCTGGACCCTCTTAAC
ACAAACTCACCTTTAACACACGGCTGCGTGGATTCGCTTTTAAACTCTGCAGACGTGTGC
TCTAGTAAAAATGTGGACATTTCAAGTGGAAGCTCCTCTCCTGTGGAGTGCTGCCATGAT
GATATGTGTAACTACAGGGGTTTGCATGAC---CTCACACACCCCAGAGGGGACTCAACA
GAC---------CGATACCACAGC---TCCAATCAGAACCTGATCACAAGGGTGCAAGAG
TTAGCGTCTGCTAAAGAGGTGTGGTTCCGGGCGGCGGTGATAGCGGTTCCCATCGCGGGT
GGGCTTATCCTGGTTCTGCTGATTATGCTGGCGTTGCGAATGCTCCGTAGCGAAAACAAG
CGTCTCCAGGCACAGCGCCAGCAGATGCTTTCTCGCCTGCATTACAGCTTTCACGGACAC
CACCATGCCAAGAAAGGCCACGTGGCTAAGTTGGACTTGGAGTGTATGGTGCCGGTAACG
GGACATGAGAACTGTTGTCTGGGCTGCGATAAGCTGCGGCAGACGGATTTGTGCACTGGA
GGAGGAAGCGGGGGTGAGCGTCTCCTATCTCTGGTACACTGGGGGATGTACACGGGGCAC
GGAAAGCTGGAGTTCGTA

...

To batch calculate 4DTV, simply merge many axt files into one file (AXT file) using a shell script. Note: A sequence does not have a line break in the merged AXT file.

> Merged.AXT
for file in `ls  *.axt`;do
  Ln=$((`sed '/^$/d' axt/$file | wc -l`/2+1))
  if [ $Ln -ne 0 ];then
     sed "$Ln a \%" axt/$file | sed '1 a \%' | tr -d "\n" | tr "%" "\n" >> sample/${n}/$2.AXT &
  fi
done

Merged.AXT

seq1-seq2
ATGTCTCATATGTCTTCTGTGAACGCGAAAAATCTTCAAAAACTAGCAGATTCAATTGTC AAACATGTAAAGCACTTTAACAATAATGAAGTTTTGTGTCTGATCAAACTCTTCAATGTG CTGATGGGAGAGCAGAGCGAGCACAGGGTTGGAAATGGACTGGATCGTGGTAAATTCAGG AGCATCCTCCACAACACATTTGGAATGACAGATGACATGATTATGGACAGAGTCTTCCGT GCATTTGACAAGGACAATGATAGCAACGTCAGTGTAAAAGAATGGATAGAAGGACTTTCA GTGTTTCTGCGAGGGACCTTGgatgaaaaaattaaatATTGTTTTGAGGTTTATGACTTA AATGGGGATGGATATATTTCACGAGAAGAGATGTTCCACATGCTGAAAAACAGTCTCATA AAACAACCAACAGAAGAAGATCCAGATGAAGGGATTAAGGACTTGGTAGAGATTACTCTT AAAAAGATGgaCCACGATCACGACAGCAGACTTTCATACGCTGATTTTGAGAAAGCAGTA AAAGAAGAAAATCTCTTGCTTGAGGCTTTTGGAGCTTGTCTTCCTGATGCAAAGagtaTT CTTGCTTTTGAAAGACAGGCCTTCCAG------GATACCACAGAAAAT
atgctgaaaatgTCGGCGATGAACAGAAAATTAATTCAAAACCTCGCCGAGACTTTATGC AGACAAGTCAAACATTTTAATAAAACAGAGACGGAGTGTCTGATAAGGCTGTTCAACAGT CTGCTGGGAGAGCAGGCAGAGAGAAAGACGACTATTGGAGTGGACCGGGCCAAATTCAGA AATATACTGCACCACACTTTCGGGATGACCGACGACATGATGACGGACAGAGTTTGTCGT GTCATTGACAAGGACAACGATGGCTACTTAAGCGTTAAAGAGTGGGTTGAggctctgtct gtctttctaagAGGCACACTGGATGAAAAAATGAAATaCTGTTTTGAGGTGTATGACCTG AACGGGGATGGATACATCTCACGTGAGGAGATGTTTCAGATGCTGAAAGACAGCCTCATC AGGCAGCCCACCGAAGAGGATCCTGATGAGGGGATTAAGGATATTGTGGAGATTGCCTTG AAAAAAATGGATTATGACCATGATGGAAGAGTTTCTTATGCTGATTTTGAGAAGACGGTC ATGGATGAAAACCTTTTACTAGAAGCTTTTGGAAACTGCCTTCCTGATGCAAAGAGTGTA CTAGCATTTGAGCAACAGGCATTCCAGAAACACGAACACTGCAAAGAA
seq3-seq4
ATGGATCGCCATTCCAAtttaatttccatttggctgcaACTGGAACTGTGTGCCATGGCA GTACTTCTGGCAAAAGGGGAGATAAGATGCTACTGTGATGCAGCGCATTGTGTGGCAACA GGTTACATGTGTAAATCCGAGTTAAATGCCTGCTTCACCAGGCTTCTGGACCCACAGAAC ACAAACTCCCCTCTCACGCATGGCTGCTTGGACCCGACTGCAAACACAGCAGATGTTTGC CATGCTGGAAGGACAGAGAGCCGCGCTGGGGCCTCGGAGAAGCTTGAGTGCTGTCACGAC GATATGTGCAATTACAGAGGACTCCATGATGTTGTTTCATATCCCAGGGGGGACAGCTCA GATCATGGAACAAGATATCAGCCAGACAGTAGCAGGAATCTTCTGACCAGGGTTCAGGAT TTAACATCCTCTAAAGAGCTGTGGTTCAGAGCAGCCGTGATCGCTGTGCCCATCGCTGGG GGGCTCATTCTAGTGCTTCTCATCATGCTCGCCTTGCGGATGCTTCGAAGTGAAAACAAA AGACTGCAGGACCAGAGGCAGCAGATGCTGTCCCGCTTGCACTACAACTTTCATGGA--CACCACACGAAGAAGGGCCAGGTAGCCAAACTGGATTTGGAATGCATGGTTCCCGTAACC GGACACGAGAACTGCTGTATGACTTGCGACAAACTGCGACAGTCTGAACTCCACAAT-----------------GATAAATTGCTGTCTTTAGTTCACTGGGGAATTTACAGCGGTCAC GGGAAATTGGAATttgta
ATGGATCGC---------CTGGTTTCTCTGTGGTTTCAGCTGGAACTTTGTGCGATGGCT GTTCTTCTCACGAAAGGAGAGATCAGGTGCTACTGTGACGCACCGCACTGCGTTGCCACC GGATACATGTGTAAATCAGAGCTCAACGCTTGCTTTACTAAGGTCCTGGACCCTCTTAAC ACAAACTCACCTTTAACACACGGCTGCGTGGATTCGCTTTTAAACTCTGCAGACGTGTGC TCTAGTAAAAATGTGGACATTTCAAGTGGAAGCTCCTCTCCTGTGGAGTGCTGCCATGAT GATATGTGTAACTACAGGGGTTTGCATGAC---CTCACACACCCCAGAGGGGACTCAACA GAC---------CGATACCACAGC---TCCAATCAGAACCTGATCACAAGGGTGCAAGAG TTAGCGTCTGCTAAAGAGGTGTGGTTCCGGGCGGCGGTGATAGCGGTTCCCATCGCGGGT GGGCTTATCCTGGTTCTGCTGATTATGCTGGCGTTGCGAATGCTCCGTAGCGAAAACAAG CGTCTCCAGGCACAGCGCCAGCAGATGCTTTCTCGCCTGCATTACAGCTTTCACGGACAC CACCATGCCAAGAAAGGCCACGTGGCTAAGTTGGACTTGGAGTGTATGGTGCCGGTAACG GGACATGAGAACTGTTGTCTGGGCTGCGATAAGCTGCGGCAGACGGATTTGTGCACTGGA GGAGGAAGCGGGGGTGAGCGTCTCCTATCTCTGGTACACTGGGGGATGTACACGGGGCAC GGAAAGCTGGAGTTCGTA

Batch 4DTV calculation

batch_4DTV_calculation.pl Merged.AXT > Merged.4DTV

The 4DTV results are in Merged.4DTV