SiteGA - binding Sites recognition by Genetic Algorithm

Description

SiteGA implements the algorithm of Levitsky et al. (2007) for de novo motif search in a single ChIP-seq dataset (ChIP-seq peaks) (Tsukanov et al., 2022). SiteGA applies a genetic algorithm that evaluates motifs as patterns of frequencies of interdependent locally located dinucleotides (LPDs). This approach models short- and long-range interactions of nucleotide context within transcription factor binding sites (TFBS). Thus, the SiteGA approach is quite different from that of the traditional position weight matrix (PWM) model, which estimates sites by sums of additive contributions of nucleotide frequencies from all positions. The PWM approach ignores any dependencies between different positions within the TFBS.

For a given set of N foreground sequences (peaks), the fitness function F(X) = D(X) * E(X) estimates the alignment X = {x(1), x(2), ..., x(N)} of N best predicted sites in the the SiteGA model. The first factor D(X) reflects dependencies of positions within an alignment through the linear discriminant analysis approach (Levitsky et al. 2007). The second factor E(X) implies the average enrichment of k-mers within the alignment of the predicted sites (Tsukanov et al., 2022). The program indexes all possible positions of sites in the foreground set bwith weights equal to the logarithms of the enrichment of their intrinsic k-mers when comparing foreground and background sequence sets.

Requirements

SiteGA source code is written in C++ language. To compile exetubables from the source code you need:

In Linux system, C++ compiler, the minimal one is GCC compiler, more prominent is Clang under the management of Spack tool
In Windows system any VC++ package, e.g. Microsoft Visual Studio Community

Input data

The bulk of input data are ChIP-seq peaks in FASTA format. To optimize computation time, the default limit of 3000 bp for the length of any peak is used, although the length of a peak is not restricted by the algorithm. This same reason requires at least a moderate number of peaks to be used in the analysis, usually ~1000 peaks are sufficient to obtain a motif.

Source code

Folder src contains SiteGA source code files that relate to individual pipeline modules and are described below.

Binary files

Folder bin contains SiteGA source code files compiled by Clang under the management of Spack tool. It is recomended to use these binary files instead of those compiled by default g++ compiler due to computation speed. Just copy these binary files to the src folder and ensure that they are functional. For illustrative purpose only all scripts are started from binary files compiled with default g++ compiler, see build.sh.

1. Background sequence set generation

The background set of sequences is required as a complement to the foreground set to select the values of two parameters of the SiteGA model: motif length and the number of LPDs. The main purpose of the background set is to exclude artifact motifs related to a genome-specific sequence content bias, e.g. polyA, from the results of de novo motif search. To prepare genome-specific sets of background sequences, it is recommended to use the AntiNoise package (Raditsa et al., 2024).

2. Set parameters of a model through accuracy estimation

andy0bsn5cell.cpp performs the bootsrap cross-validation test to select parameters of a model providing the best performance for given foreground and background sets. A model has parameters of the motif length and the range for the number of LPDs. The motif length, the minimal value for the number of LPDs and it step are the parameters of command line. Recommended motif lengths are from 8 to 12 nt. The genetic algorithm simultaneously tests multiple values of the number of LPDs. The internal parameters of genetic algorithm are the range of the number of LPDs and its increment, default range consists of sixteen numbers of LPDs, 18, 20, etc. up to 48. The bootsrap cross-validation test denotes the partitioning of the foreground set into subsets of training and control sequence sets, the former is used to train a model and the latter to measure its performance. The maximal partial area under curve (pAUC) and area under precision-recall curve (AUCPR) are used to estimate the accuracy of a model for certain motif length and the number of LPDs. The term partial means that only the part of a ROC curve respecting the criterion FPR < 0.001 is impied for pAUC computation. The receiver operating characteristic (ROC) curve means the dependence between True Positive Rate (TPR) and False Positive Rate (FPR). TPR/FPR (axes Y/X of the ROC curve) are defined as the fraction of sequences from the foreground set containing predicted sites and the frequency of predicted sites in the background set, respectively. FPR = Nb/Wb, here Nb - number of predicted sites in the background set, Wb - total number of checked positions for predicted sites in the background set. The dependence between Precision and Recall represents AUCPR. Recall = TPR = TP/(TP+FN), Precision = TP/(TP+FP), here TP & FP denote counts of predicted sequences from foreground & background sets, FN denote counts of not predicted sequences from the foreground set.

3. Train a model

andy05cell.cpp trains a SiteGA model with selected by the bootstrap cross-validation procedure parameters of the motif length and the number of LPDs for given foreground and background sets. The resulting model is written in a special matrix file

4. Set threshold for a model

sitega_thr_dist_mat.cpp creates table of thresholds for model's hits search in test sequences (Scan test sequences with a model module) based on the distribution of recognition score for the set of whole-genome promoter sequences selected for the respective species. Prepeared sets are stored in genomes folder. The threshold selection implies the estimation of Expected Recognition Rate (ERR) of a model for promoter sequences of whole genome. The dependence of the threshold from ERR is stored in a special file containing a list of two columns: recognition thresholds in descending order and corresponding -Log10(ERR) values

5. Scan test sequences with a model

andy1_mat.cpp scans test sequences with a constructed model and the selected threshold of a model.

6. Scan whole genome with a model

andy1_mat_long.cpp scans whole genome as a set of full-sized chromosomes with a constructed model and the selected threshold of a model.

How to compile

In Linux system:

git clone https://github.com/parthian-sterlet/sitega

cd sitega/src

chmod a+x build.sh

./build.sh

In Windiws system:

separate compilation of all source files in VC++

Integration of modules

Scheme of modules functioning is given below

Modules Set parameters of a model through accuracy estimation and Train a model take the input data of the foreground and background sequence sets, the background set is prepared by the Background set generation module

The module Set parameters of a model through accuracy estimation is required for functionality of the module Train a model and all subsequent modules since the bootstrap procedure correctly selects parameters of a model (see output data block Table FPR vs. TPR, ROC curve & pAUC)

Modules Set threshold for a model and Scan test seauences with a model require file with SiteGA model which should be previously prepared by the module Train a model

The module Set threshold for a model is required to select a correct threshold for Scan test sequences with a model module

How to run separate modules

Lists of command line arguments for all modules are described below

Background set generation

Use background_genome_mono.cpp from the github repositiory AntiNoise

Set parameters of a model through accuracy estimation

andy0bsn5cell.cpp

path to fasta files with sets of foreground and background sequences (the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
fasta file, set of foreground sequences
fasta file, set of background sequences
integer value, maximal length of one LPD (default value 6)
integer value, minimal number of LPDs (default value 18)
integer value, step of the number of LPDs (default value 2, i.e. numbers of LPDs 18, 20... etc. up to 48 are tested, the default number of distinct numbers of LPDs is sixteen)
integer value, length of motif in nucleotides, the motif length from 8 to 12 nt is recommended. The computation time increases greatly with the growth of motif length above 12 nt, due to substantion growtch of posible combinations of local nucleiotide contexts within a motif. The best solution is to test various motif lengths from 8 to 12 nt and select among them the one providing the best recognition accuracy.
integer value, cross-validation type specification: positive value below 1 means the ratio of the training subset size to that of control subset for repeated random subsampling validation, default value -1 means equal sizes of training and control subsets, odd/even peaks are used either for training and control subsets)
integer value, number of iterations in bootatrap (default 2 implies two-fold cross-validation)
integer value, k-mer length to take into account the sequence bias between foreground and background sequences (default 6, i.e. hexamer frequencies are involved)
path to output files (the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
integer value, maximal peak length (default value is 3000)
output log file

Train a model

andy05cell.cpp

path to fasta files with sets of foreground and background sequences (the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
fasta file with set of foreground sequences
fasta file with set of background sequences
integer value, maximal length of one LPD (default value 6)
integer value, length of motif (this value is selected in the bootstrap cross-valiation test, see above)
integer value, size, the number of LPDs (a value is selected in the bootstrap cross-valiation test, see above)
integer value, k-mer length to take into account the sequence bias between foreground and background sequences (default 6, i.e. hexamer frequencies are involved)
path to output files (the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
integer value, maximal peak length (default value is 3000)
output log file

Set threshold for a model

sitega_thr_dist_mat.cpp

path to file_profile_fasta (see argument #3 below, the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
sitega_matrix_file = input file SiteGA model from Train a model module
file_profile_fasta = input Whole-genome promoters set in fasta format (archive with example provides one for M. musculus)
output file, text mode, Table Threshold_vs_ERR, table SiteGA model threshold vs. Expected Recognition Rate (ERR)
output file, binary mode, SiteGA model as a matrix and Table Threshold_vs_ERR, table SiteGA model threshold vs. ERR)
double pvalue_large = maximal ERR (default value 0.001)
double score_min = lowest threshold of SiteGA model (default value 0.75)
double dpvalue = granulation value for ERR compaction in Table Threshold_vs_ERR, the default value 0.0000005 implies a moderate compaction

Scan test sequences with a model

andy1_mat.cpp

test file in fasta format with its path
sitega_matrix_file = input file SiteGA model from Train a model module
site_description_mode = 0 or 1. the value 0 means default mode, the value 1 means computation of frequencies of all LPDs for all tested sequences (option is used for the train fasta file to describe a SiteGA model)
ERR threshold = threshold for ERR of SiteGA model is used to select the SiteGA threshold according to input file Threshold vs ERR table from Set threshold for a model module
input file Threshold vs ERR table for SiteGA model threshold selection by ERR threshold
output file Profile with recognized hits

Scan whole genome with a model

andy1_mat_long.cpp

path to chromosome sequences in Fasta format, individual chromosomes are in separate files in fasta format, they are prepared by splitting the whole-genome fasta file by the https://github.com/parthian-sterlet/antinoise/blob/main/src/fasta_muliplefiles.cpp and https://github.com/parthian-sterlet/antinoise/blob/main/src/fasta_to_plain0.cpp from the AntiNoise package.
species and genome release (values hg38, mm10, rn6, zf11, dm6, and ce235; at10, gm21, zm73, and mp61; sc64 and sch294). The animals inludes human Homo sapiens hg38, mouse Mus musculus mm10, rat Rattus norvegicus Rnor_6.0, zebrafish Danio rerio GRCz11, fly Drosophila melanogaster dm6, and roundworm Caenorhabditis elegans WBcel235; the plants are arabidopsis Arabidopsis thaliana TAIR10, soybean Glycine max v2.1, maize Zea mays B73, and liverwort Marchantia polymorpha MpTak v6.1; the fungi are baker's yeast Saccharomyces cerevisiae R64-1-1 and fission yeast Schizosaccharomyces pombe ASM294v2.
sitega_matrix_file = input file SiteGA model from Train a model module
input file Threshold vs ERR table for SiteGA model threshold selection by ERR threshold
double value, ERR threshold = threshold for ERR of SiteGA model is used to select the SiteGA threshold according to input file Table Threshold vs ERR from Set threshold for a model module
output file Profile with recognized hits

Examples scripts:

These scripts implement various pipelines for Linux:

bootstrap test for a model - Estimate accuracy for a model module
training a model - Train a model module
thresholds selection by ERRs - Set threshold for a model module
sites recognition for a test fasta file with a model - Scan test seauences with a model module
sites recognition for a whole genome with a model - Scan whole genome with a model module

Interpretation of results

Background set generation

background_genome.cpp from the AntiNoise package prepares the background set that may be used for training (Train a model module) or performance evaluation (Set parameters of a model through accuracy estimation module)

Set parameters of a model through accuracy estimation

andy0bsn5cell.cpp several times construct distinct SiteGA models (parameter of command line 'number of iterations'), but each time it uses only a part of the foreground set for training, the rest (control) part of set is used to estimate FPR). The output file Table pAUC file reoresenting pAUC values computed by ROC curves for various values of parameters of a model allows selection of the one model among several ones. These several ones come from different numbers of LPDs and lengths L of a motif.

Train a model

andy05cell.cpp constructs one sitega model, with the numbers of locally positioned dinucleotides (LPDs) assigned according the parameter of the command line, this value deduced from the bootstrap cross validation test (see the previous paragraph). The selected model respecting the maximal pAUC in the bootstrap cross validation test, this model is written in output file SiteGA model with .mat extension

Set threshold for a model

sitega_thr_dist_mat.cpp computes the distribition of the recognition scores of SiteGA model, output file Table Threshold vs ERR represents two columns with respective thresholds and ERRs

Scan test sequences with a model

andy1_mat.cpp takes ready sitega model and threshold from input file Table Threshold vs FPR and constructs the profile of hits for tested file in fasta format, main output file is Profile file I.e. after the header of each peak with first '>' symbol from zero to several lines respect to separate hits, for each hit a start position, score, strand and whole sequence are printed. Interpretation of SiteGA model is performed through computation of LPD frequencies for the training fasta set (see fourth argument of the command line). Hence, a computed matrix Number_of_sequences vs. Number_of_LPDs can be used for the correlation analysis, e.g. for revealing the most correlated LPDs in a SiteGA model.

Scan whole genome with a model

andy1_mat_long.cpp takes ready sitega model and threshold from input file Table Threshold vs FPR and constructs the profile of hits for a genome in plain format, main output file is Profile file I.e. after the header of each peak with first '>' symbol from zero to several lines respect to separate hits, for each hit a start position, score, strand and whole sequence are printed. Interpretation of SiteGA model is performed through computation of LPD frequencies for the training fasta set (see fourth argument of the command line). Hence, a computed matrix Number_of_sequences vs. Number_of_LPDs can be used for the correlation analysis, e.g. for revealing the most correlated LPDs in a SiteGA model.