This repository accompanies the paper *Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate *
All figures and tables were created with scripts present in that directory. Supplementary Figures/Tables deriving from main Figures/Tables are not described in details in this README as they were generated with the same scripts, usually by filtering the inputs.
Please check that your version of this repository is up to date (https://bitbucket.org/makova-lab/kinetics_wmm)
- 1.0.0
- Learn Markdown
Samtools : Version: 1.3.1 (using htslib 1.3.1) Bedtools : Version: v2.26.0 Bedops : Version: 2.4.20
PacificBiosciences SMRT-Link : smrtanalysis-2.3.0
Python 2.7.3 (default, Mar 13 2014, 11:03:55)
Naive Variant Caller, from modified version at tools-blankenberg/tools/naive_variant_caller/tools/naive_variant_caller.py
Download files from url:
Url for portal:
https://nonb-abcc.ncifcrf.gov/apps/Query-GFF/feature/
Enter following options:
Species: Human
Classes: NonBDAS
Search by Type: Chromosome
Chromosome: 1 to 22 (one chromosome at a time only)
Query Type: all features for the region
Features Types to Retrieve: choose all but Short Tandem Repeats
Microsatellites annotation files obtained through STR-FM (see Fungtammasan, A. et al. Accurate typing of short tandem repeats from genome-wide sequencing data and its applications. Genome Res.2015.)
Download files from url:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
Download files from url (chromosomes 1 to 22):
http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/multiz100way/maf/
The .mf file describes the interval of interest and thus overlaps with non-B DNA annotations. This file format will either contain feature-restricted intervals or full 100bp windows. This tab delimited file contains 4 required columns:
- chromosome
- start in 1-based format
- end in 1-based format
- feature length number of nucleotides in feature, note that this field is present in both feature-restricted and 100bp windows. Occassionally, feature can be longer than 100bp. In such cases, we trim it down to 100bp and restrict our analysis to trimmed windows.
- IPD values variable number of columns containing IPD values from ipdSummary, each column represents one nucleotide from feature or 100bp window, thus at most 100 IPD values are present
The .collapsed file describes the final error rates in given intervals.
- chromosome
- start in 1-based format
- end in 1-based format
- total rate as fraction <0,1>
- mismatch rate as fraction <0,1>
- insertion rate as fraction <0,1>
- deletion rate as fraction <0,1>
-
Download and extract the data
-
Recover *.bax.h5
-
Aligning:
pbalign --nproc 8 $inp ./hg19.fa ${inp}.cmp.h5 --forQuiver --metrics IPD
nproc: number of processor used - depends on your machine
-
Merge alignment into one file:
find ./ -name "*cmp.h5" -type f -exec cmph5tools.py merge --outFile out_all.cmp.h5 {} +
-
Sanity checks:
use coverage.py (see explaination inside script) ?
cmph5tools.py summarize out_all.cmp.h5
-
Deep sort alignment (optional?):
cmph5tools.py sort --deep out_all.cmp.h5
-
Separate alignment in chromosomes:
cmph5tools.py select --where "(Reference == 'chr${inp}')" --outFile chr${inp}.cmp.h5 out_all.cmp.h5
-
Compute IPDs:
ipdSummary.py $inp --reference ../hg19.fa --outfile ${inp}.idp
cat *pickle > 52X.pickle
python cleanIPD.py 52X.pickle > 52XIPD
-
Compute Depth:
python cleanDepth.py 52X.pickle > 52XDepth
-
Download non-B DNA annotation
-
Filter G4 into G4Plus and G4Minus and format:
grep '+' G_Quadruplex_Motifs | cut -f 1-6 > G4Plus
-
Format unstranded motifs:
cut -f 1-6 Direct_Repeats > DirectRepeats
-
Format Microsat Annotation:
python format.py all.per1.TRs.bed.10 Mono
python line_up_microsats.py Mono > Mono.aligned
python parse_repeats.py Mono.aligned
for a in *n; do cut -f1,2,3,4,5 $a > ../GFF/$a; done;
Repeat for Di, Tri and Tetra
-
Prepare windows for study: #we still have an issue of features bigger than windows overlapping with surronding windows
python ../prepare_windows.py ../GFF/ > Windows_Ready
-
Sort windows: #/!\ This step is error prone /!\ Make sure sorting was properly performed /!\
cat Windows_Ready | env LC_ALL=C sort -k 1,1d -k 2,2n > Windows_Sorted
-
Collect IPD values in windows: #adjust collect_values_in_windows.py for F or R
python ../collect_values_in_windows.py Windows_Sorted 52XIPD > Windows_Collected_F
python ../collect_values_in_windows.py Windows_Sorted 52XIPD > Windows_Collected_R
-
Split by Features: #/!\ Files created are .mf files. The extension requires to be added manually. /!\ Windows with no IPD not filtered out in this step.
python ../../split_by_feature.py Windows_Collected_F
python ../../split_by_feature.py Windows_Collected_R
-
Filter out windows with no IPD and create feature length files (repeat on all files generated at step 8):
FILES=./*
for file in $FILES
do
awk 'BEGIN {OFS=FS="\t"} {$1="chr"$1; if(gsub(/nan/,"NA")<100) print}' $file > $file"_filtered"
awk 'BEGIN {OFS=FS="\t"} {print $1, $2, $3, $4}' $file"_filtered" > $file"_filtered_length"
sed -i -r 's/(\s+)?\S+//4' $file"_filtered"
done
-
IPD data now ready for IWT. ReDo step 7, 8 and 9 with 52XDepth for IWT on Depth.
-
Loading data in R in IWTomics format and creating subsamples for test (see detailed comments inside R script):
load_IPD_data.r
-
Perform IWTomics test and plot results (see detailed comments inside R script):
IWT_IPD.r
-
IPD analysis and IWT test for motifs with different length (see detailed comments inside R script):
feature_length_IPD.r
-
Obtain coordinates of windows:
python bedformatting.py Windows_Collected_F
-
Get sequence inside each window:
bedtools getfasta -s -fi hg19.fa -bed Windows_Collected_F.bed > getFastaF
-
Get sequence composition of each window:
python compo.py Windows_Collected_F.bed getFastaF > Windows_Collected_F.compo
-
Split by Features:
python split_by_feature.py Windows_Collected_F.compo
-
Filter out windows with no IPD and create composition files (repeat on all files generated at step 4):
FILES=./*
for file in $FILES
do
awk 'BEGIN {OFS=FS="\t"} {gsub(/[A]/,"",$1); gsub(/[TCG]/,"\t",$1); $3=$3+1; print $2, $3, $4, $1}' $file > $file"_composition"
intersectBed -wa -f 1 -a $file"_composition" -b $file"_filtered" > "../forward/"$file"_filtered_composition"
awk 'BEGIN {OFS=FS="\t"} {gsub(/[A]/,"",$1); gsub(/[TCG][TCG]/,"\t",$1); gsub(/[TCG]/,"\t",$1); $3=$3+1; print $2, $3, $4, $1}' $file > $file"_composition_di"
intersectBed -wa -f 1 -a $file"_composition_di" -b $file"_filtered" > "../forward/"$file"_filtered_composition_di"
done
-
ReDo steps 1 to 5 with Windows_Collected_R
-
Perform sequence composition analysis (see detailed comments inside R script):
sequence_composition_IPD.r
-
Get the 10 most common G4 and collect IPD values. Use the following line to retrieve the sequences of interest.
awk '{if ($4 == "GGGTGGAGGGTGGGAGGAGGG") print $0}' Windows_Collected_F >> intermolecular
-
Analyze delta-epsilon and Tm experimental results (see detailed comments inside R script):
thermostability.r
Requires: .mf files, RepeatMasked file with coordinates of repeats in a genome, and a file with high quality calls that contrasts sequenced individual with a reference.
-
Generate corresponding .gff file for each .mf file. The reason for this conversion is that .gff files are a suitable format for the intersection with other datasets.
mf2gff.sh folder_with_mf_files
This script requires the name of the folder with .mf files as a single parameter. It will generate a corresponding .gff file for each .mf file. -
Filter out windows that meet certain criteria. For example, one might wish to remove windows overlapping with the known repeats in a human genome as identified by RepeatMasker. For analysis of SMRT errors, it might be also desirable to additionally remove the known variants in an analyzed human individual (HG002). This allows one to analyze sequencing errors without confounding them with the real variants in which the sequenced individual differs from a reference genome.
Create separate folders for RM (repeatmasked), trueVariants (personal variants) and joint (both).
joint_filtering.sh
(SMRT errors, removes repeats and variants)This step generates new files with suffixes *interspersed_containing.gff and *interspersed_filtered.gff (overlaps with repeats or not) and suffixes *varContaining.gff and *varFiltered.gff (overlapping with variants or not), as well as prefixes joint_ (both filters applied).
Following high-quality HG002 calls were used for the filtering step: HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz
The repeat track can be downloaded from http://genome.ucsc.edu/cgi-bin/hgTables as a bed file. We used human interspersed repeats for filtering and named the file Human_interspersed.
joint_filtering.sh calls both filter_out_repeatmasked.sh and filter_out_true_variants.sh; these can be also run independently.
filter_out_repeatmasked.sh
(removes only repeats, suitable for Diversity, Divergence)filter_out_true_variants.sh
(removes only variants) -
Use new coordinates in order to re-create .mf files that are filtered. Use on filtered .gff files. Only the output of step 2 will work with the script. Requires gff2mf.py (or gff2mf_regVar.py). These scripts take two input parameters: original .mf file (before filtering) and .gff file with coordinates of a new subset of windows
To automatically create filtered .mf files for folders trueVariants, RM and joint that should contain filtered .gff files, one can run:
convert_gff_to_mf.sh original_folder_with_mf_files
-
Keep only motifs and remove the flanking regions around motifs inside their 100bp windows. This step is optional, depending on whether you plan to analyze only features or full 100bp windows.
generateFeatures.py mf_file
will generate a new file that will only include features (Note that when original motifs are bigger than 100bp, only 100bp centered around the motifs will be kept; note that this is only relevant for DirectRepeats that can be very long.) -
Generates 10 random controls for each motif.
generateControls.sh folder_wit_mf files
Script that generates 10 sets of control datasets for the .mf files inside the provided folder. Output: 10 folders with matching controls for all the motifsThe script requires a control file to be present. Typically, a control file contains a large number of windows. From this set of windows, as many will be used as are present in a motif file. For example, if a motif file contains 2,500 windows, the subset of randomly selected 2,500 control windows will be used. This control file is expected to be named Empty.mf and therefore it's important to make sure that only one control file is present in a folder before running this script. Since small variation in sampling is to be expected, as many as 10 independent folders with controls will be created.
Requires generateEmptyTrack.py (or generateEmptyTrack_regVar.py). Note that these scripts can be run independently by running:
python generateEmptyTrack.py mf_file control_file output_directory
where mf_file is the .mf file for which the control should be generated control_file is a large file with all the control windows (these would be regions of the genomes that do not contain non-B DNA) output_directory is an optional parameter for the location of an output
Important note: the generated controls can be either from anywhere in the genome or localized in a close proximity of a motif (such as 0.5 Mb upstream or downstream) in order to account for the regional variation. In such case, scripts in a folder controls_regVar can be used. These scripts differ slightly from the scripts described above; for example, generateEmptyTrackRegVar.py instead of generateEmptyTrack.py, or gff2mf_regVar.py instead of gff2mf.py
generateFeatures.py This script restricts the intervals to features only.
Input: .mf file
Output: .mf file with FeatureOnly suffix
generateEmptyTrack.py script This script generates matching controls for each motif by subsampling from motif-free windows. Sampling is implemented with the Linux function shuf.
Input: .mf file, optionally output directory
Output: .mf file
Dependency: runErrorStatistics_optimized.R, reorder_pacbio.py
-
Compute errors.
generateErrors.sh
-
Merge and reorder. Since many windows are run in parallel, the results need to be merged and re-ordered to match the order in the original .mf files.
merge_and_reorder.sh
-
Concatenate all chomosomes multiple alignment:
cat chr*.maf > WG.maf
-
Filter by species of interest (homo - pongo - rhesus outgroup):
python filter_out_maf.py > WG.filtered.maf
-
Parse for SNPs:
mafparser.py WG.filtered.maf > WGSNP.gff
-
With use.galaxy.org, parse for INDELs:
- upload .mf files for features and controls
- use Extract MAF Blocks on .mf files
- use Fetch Indels on extracted MAF blocks
-
Format Indels in GFF and concatenate:
python parse_indels.py indels.fetched > WGINDELS.gff
cat WGSNP.gff WGINDELS.gff > Divergence.gff
-
Intersect variants with features coordinates - use .gff created in TABLE 1:
bedtools intersect -wa -wb -b Divergence.gff -a ${inp}.gff -loj > ${inp}.intersect
python parse_intersect.py ${inp}.intersect > ${inp}.collapsed
python reorder.py ${inp} ${inp}.collapsed
-
Compute number of variants inside features:
python rates.py ${inp}.intersect
-
Extract polymorphisms from 1000G VCF files, split by type and frequency range:
python filterVCFforMAF.py ALL.chr${X}.* chr${X}
-
Concatenate chromosomes:
cat highfreqsnpchr* > highfreqsnp.intervals
cat highfreqindelchr* > highfreqindel.intervals
cat lowfreqsnpchr* > lowfreqsnp.intervals
cat lowfreqindelchr* > lowfreqindel.intervals
-
With use.galaxy.org, get multiple alignments (pan, gorilla, pongo, nomascus around indels):
-
Upload the indel.intervals
-
extract MAF blocks
-
maf to intervals
-
-
Polarize indels:
python Join.py highfreqindel.maf.intervals highfreqindel.intervals
python Polaryze.py highfreqindel.maf.intervals.joined
-
Format output in GFF:
python Intervals_to_gff.py highfreqsnp.intervals > highfreqsnp.gff
python Polarize_to_gff.py highfreqindel.polarized > highfreqindel.gff
cat highfreqsnp.gff highfreqindel.gff > highfreq1kG.gff
-
Intersect polymorphisms with features coordinates - use .gff created in TABLE 1:
bedtools intersect -wa -wb -b highfreq.gff -a ${inp}.gff -loj > ${inp}.intersect
python parse_intersect.py ${inp}.intersect > ${inp}.collapsed
python reorder.py ${inp} ${inp}.collapsed
-
Compute number of polymorphisms inside features:
python rates.py ${inp}.intersect
-
Redo steps 3 to 7 for lowfreq (not presented in manuscript)
Input needed: regression_composition_results_log.RData (from FIGURE 2)
-
Create the dataframe:
python MergeIPDvsErrors.py errors.collapsed .mf divergence.collapsed diversity.collapsed
-
Paste with composition file:
paste .compo _IPD_PbError_Div_1kG > _Compo_IPD_PbError_Div_1kG
-
Select compositions (repeat on all files generated at step 2):
FILES=./*
for file in $FILES
do
tr ' ' \\t < $file > $file"_tab"
awk 'BEGIN {OFS=FS="\t"} {gsub(/[A]/,"",$2); gsub(/[TCG]/,"\t",$2); gsub(/[N]/,"",$1); print $4, $5, $6, $7, $1, $2}' $file"_tab" > $file"_composition"
awk 'BEGIN {OFS=FS="\t"} {gsub(/[A]/,"",$3); gsub(/[TCG][TCG]/,"\t",$3); gsub(/[TCG]/,"\t",$3); gsub(/[N]/,"",$1); print $4, $5, $6, $7, $1, $3}' $file"_tab" > $file"_composition_di"
done
-
Compute residuals from composition regression:
compute_residuals.r
-
Analyze errors vs residual IPD:
mismatches_vs_residualIPD.r
Input needed: GQuadPlus (from FIGURE 2) , Genome in a Bottle bam file (only chr21 reported in manuscript)
-
Create the flanks:
python flanking.py
-
Find the position of read ends:
python readends.py
-
Intersect and return as tab-delimited file:
bedtools intersect -wa -wb -a GQuadPlus2kflanks.gff -b chr21Ends.gff > GQuadPlus2kflanks_chr21Ends.intersect
python Intersect2csv.py
-
Build the histograms:
Rscript plothistograms.r
Input needed: IPD_forward.RData (from FIGURE 2)
- Create autocorrelation plots and fast fourier transform plots using script:
IPD_periodicity.r
Input from Figure 4.
- All quantile calculations found in Supplementary Note 4 & Table S9 can be found in:
quantiles.r
For this analysis, we retrieved the information about the individual molecules during PacBio sequencing. Therefore we extracted the information about MoleculeId, TemplateStart, TemplateEnd, ReadStart, ReadEnd, and Strand using cmph5tools. Next, we intersected this information with the coordinates of our motifs. This was achieved using the script
runPasses.sh bam_folder original_mf_gff_folder
where a first argument (bam_folder) stores the alignments to human reference in a cmp.h5 file format, as one file per each chromosome. The second argument is a directory that contains .gff files to be used for an intersection.
Next, each pass of a molecule that intersects either the G-Quadruplexes or the control windows is analyzed using the following R script:
passes.Rnw
Note that this script uses R libraries h5r and pbh5. While pbh5 library is very useful for accessing IPD information from within an R script, it might be difficult to install due to its dependencies.
Finally, statistical analysis is performed using the script
IPD_different_passes.r