- InSiTE is designed to analyze raw sequencing results of integration sites from viral or transposon systems.
- InSiTE will provide useful analysis of integration trends and biases that may affect safety profiles, including count integration sites in annotated regions (e.g. intron/exon/transcript) and distances to specified features (e.g. transcriptional start site).
python3 ./scripts/InSiTE.py -q ./raw_sequences.fastq -z --primer5 'ACTGACTG' --primer3 'GTCAGTCA'
- .fastq or .fasta (Also .csv or .sam/.bam, but these may still have bugs). Specified by
-q
/-a
/-c
/-s
. - Sequence files can be provided with barcodes or primer sequences (
--barcode
,--primer5
,--primer3
). Paired reads can also be processed (-p
).
Reads can be compressed with -z
. This will merge reads mapping to the same position in the genome (+/- 1 nt) and count the number of reads. The counts and % of the most frequent and top 10 most frequent integration sites is reported in the log file and a .bam file is provided sorted by read counts. Read counts are stored in the .bam file under the 'IH:i:" flag. If reads are compressed, all stats regarding mapping to genome features weight all identical IS reads (+/-1nt) as a single integration. If reads are not compressed, all identical reads are counted when calculating integration into/near genome features--this may cause biased results if duplicate integration site reads are due to clonal out-growth or PCR bias and not due to independant integration events.
As controls, randomized data can also be generated in two ways:
-n
will generate random nucleotides to replace reads provided (number of nt read and quality scores will match input files). This is useful for calibrating non-specific mappings and adjusting --min to adjust number of reads and number of likely false alignments.-r
will generate random integration sites. The number of sites and sites per chromosome will be matched to the provided real dataset. The sites will be randomly distributed throughout the chromosome. This is useful for comparing actual data set and integration in/near genome features to a truely random distribution so that biases can be calculated.
A number of output files are generated, as well as useful information printed in the terminal (& log file).
[root_file]_trimmed.fast[a/q]
sequences after trimming off adapters/barcodes[root_file]_retrieved_2bit.fasta
fasta file of integration sites with--rwindow
and--lwindow
up and down-stream of integration sites, used for creating logo plot.[root_file]IS_logo.svg
image showing consensus logo of integration site[root_file].sam
sam file of mapped reads after trimming. mapped to genome location where possible.[rootfile]IS.bam
bam file containing mapped single nt insertion sites. Sorted by position in genome (Chromosome & position).[root_file]_abundantsort.bam
bam file containing single nt insertion sites, Sorted by frequency of reads to that site (count indicated by IH:i:## tag).[root_file]_IS_mappings.csv
file listing integration locations matching the format provided by GeneWerk[root_file]_distances.csv
file containing lists of distances for each read to features (e.g. TSS) for which distance mapping is indicated.[root_file]_IS_annotations.csv
summary file showing where reads mapped to for each indicated genome feature (exon, intron, TSS, etc)[root_file].log
log file of run
parent directory for contains '/scripts', '/refrence_datasets', and /modules folders. Main script is /scripts/InSiTE.py supporting modules are in /modules and the list of cromosomes names is in /refrence_datasets.
Annotation bed files for intron/exon/etc (these can be aquired from places like https://genome.ucsc.edu/cgi-bin/hgTables)
.2bit genome file (available at https://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit) (optional but increses performance, if not present specify --remote
)
bowtie2 genome index file (e.g. ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz)
- bowtie2
- bbtools
- weblogo
- TwoBitToFa
- Python3 with packages:
- Biopython
- colorama
- pysam
- pybedtools
- twobitreader
- if python3 is at /btapps/miniconda3/bin/python3, script can be run directly, otherwise it needs to be run via python3.
usage: InSiTE.py [-h] [-q FASTQ] [-r] [-n] [-c CSV] [-a FASTA] [-s SAM]
[--no_seqs] [-u] [-p PAIRS] [--no_annotate] [--barcode NNNNN]
[--vectors VECTORS] [--lwindow LWINDOW] [--rwindow RWINDOW]
[--samwindow SAMWINDOW] [--remote] [--min MIN]
[--primer5 NNNNN] [--primer3 NNNNN] [--trim5 TRIM5]
[--trim3 TRIM3] [--feature intron/exon/transcript/TSS/etc]
[--dist True/False] [--close CLOSE]
[--chromosome_ids /path/to/chromosomes.csv]
[--bowtielocation /path/to/bowtie2]
[--bowtieindex /path/to/bowtieindex]
[--weblogolocation /path/to/weblogo]
[--twobitlocation /path/to/TwoBitToFa]
[--twobitgenomelocation /path/to/genome.2bit]
[--annotations /path/to/annotation_file.bed/gff/gtf]
[--supress_csv] [--supress_fasta] [--supress_logo]
[--append_summary] [-v]
optional arguments:
-h, --help show this help message and exit
-q FASTQ, --fastq FASTQ
fastq input file
-r, --rand_is use random integration sites matched to given query
set instead of actual query set
-n, --rand_nt use random sequences for mapping
-c CSV, --csv CSV csv input file. WARNING: This feature may not work.
-a FASTA, --fasta FASTA
fasta input file
-s SAM, --sam SAM sam/bam input file
--no_seqs do not get sequences from either entrez or local
TwoBit genome around locations indicated by
genome_location_csv
-u, --uncompress_reads
compress duplicate reads and reads shifted +/- 1nt,
number of reads are compressed in csv, fasta, and
mapping outputs
-p PAIRS, --pairs PAIRS
specify file with paired reads for paired end reads
(used in conjunction with '-q' or '-a')
--no_annotate do not map insertion sites to genome annotations
--barcode NNNNN barcode sequence to trim off of reads
--vectors VECTORS FASTA file containing sequences te exclude from
mapping, for example plasmids used in the experiment
--lwindow LWINDOW numebr of nucleotides upstream of integration site to
return
--rwindow RWINDOW number of nucleotides downstream of integration site
to return
--samwindow SAMWINDOW
depreciated
--remote get sequences from entrez server instead of 'LOCAL'
TwoBit genome
--min MIN minimum length of a read to try mapping. default (25)
will usually avoid any false positives in read sets of
200k reads
--primer5 NNNNN 5' primer sequence to remove from reads
--primer3 NNNNN 3' primer sequence to remove from reads
--trim5 TRIM5 additional (non-genomic) nts to trim off of 3' end of
reads
--trim3 TRIM3 additional (non-genomic) nts to trim off of 5' end of
reads
--feature intron/exon/transcript/TSS/etc
feature names found in feature files to map reads to,
e.g. "exon"
--dist True/False weather to map distance of each read, or only whether
reads overlap with feature. Same number of distance
variables must be given as features.
--close CLOSE distance in bp to be considered close to feature
--chromosome_ids /path/to/chromosomes.csv
--bowtielocation /path/to/bowtie2
--bowtieindex /path/to/bowtieindex
--weblogolocation /path/to/weblogo
--twobitlocation /path/to/TwoBitToFa
--twobitgenomelocation /path/to/genome.2bit
--annotations /path/to/annotation_file.bed/gff/gtf
location of annotation file(s) (bed/gff/gtf), must be
same number of files as features specified
--supress_csv do not output csv file
--supress_fasta do not ouptput fasta file
--supress_logo do not output logo
--append_summary add summary metrics to summary.csv file
-v, --verbose verbose output and logging
--supress_fasta do not ouptput fasta file
--supress_logo do not output logo
--append_summary add summary metrics to summary.csv file