Gibby (v0.1.0)

Gibby is a Python package designed to find the motif of a transcription factor de novo based on ChIP-seq data. Given a genome and peak file, the tool extracts sequences from the genome based on the peak file, and runs Gibbs Sampling on those peak sequences. During the sampling process, kmers across the peak sequences are compared and scored. The output of Gibbs sampling is a list potential motifs of length k; the position frequency and position weight matrices of these potential motifs are created and saved as a text file. The position weight matrix is visualized using seqLogo to observe which motif was most strongly conserved among the peak regions of the transcription factor. Implementation details can be found in Project_Report.pdf and supplementary data used in the report can be found in the benchmarking and analysis folders of the repo.

What's the input?
What's the output?
What is Gibbs Sampling?
All Features
How do I install Gibby?
I'm getting errors! What do I do?
How do I use Gibby?
Examples

Input

Gibby can take HOMER peaks.txt files and .bed files generated from ChIP-seq data as input. In addition, we require the appropriate genome assembly (which was used to generate the peak file) in FASTA format.

Output

Three files are generated: PFM.txt, PWM.txt, and motif.png. These files include the position frequency matrix and position weight matrix generated by gibbs sampling, and a visualization of the sequence motif showing which motif was most strongly conserved among the peak regions.

What is Gibbs Sampling?

Gibbs Sampling is a statistical method used to estimate the distribution of variables when direct sampling is difficult. It is particularly useful in motif finding where we want to identify common patterns (motifs) in a set of sequences.

For example ...

Imagine you are trying to perfect a secret recipe, but you don't have all the ingredients at once. You start by randomly choosing some ingredients and proportions, then you taste the result. Based on how good or bad it tastes, you keep some ingredients, change others, and try again. Each iteration helps you understand what works and what doesn't, gradually leading you to the best recipe.

In the context of genomic sequences, Gibbs Sampling helps us identify common patterns (motifs) by iteratively refining our guesses based on the sequences we have. Each iteration, which involves some randomness, helps to gradually reveal the underlying motifs more accurately.

Process for Gibbs Sampling

The problem we are trying to solve here is:
Given S sequences, find the most mutually similar subsequences of length k from each sequence

In order to tackle this problem it is crucial to look at the entire statistical landscape by sampling every single sequence and seeing if we can converge to a minima that is the optimal or somewhere extremely close to the optimum.

Randomly choose a starting position for the subsequence of length k in each of the S sequences. For each iteration, leave out one sequence, say sequence s'.

Using the remaining S-1 sequences, build a position-specific scoring matrix (PSSM) or profile matrix. This matrix represents the frequency of each nucleotide at each position of the subsequence. USE PSEUDOCOUNTS!!!

Calculate the probability of every possible subsequence of length k in the left-out sequence s' using the profile matrix. This involves calculating the likelihood of the subsequence given the profile and normalizing it to get a probability distribution.

$$Random(\frac{2}{8^4}, \frac{2}{8^4}, \frac{72}{8^4}, \frac{24}{8^4}, \frac{8}{8^4},\frac{4}{8^4}, \frac{1}{8^4}) = Random(\frac{2}{113},\frac{2}{113},\frac{72}{113},\frac{24}{113},\frac{8}{113},\frac{4}{113},\frac{1}{113})$$

Sample a new position for the subsequence in sequence s' according to the probability distribution obtained in the previous step. This new position replaces the old position for sequence s'. In this case we chose ACCT.

After this we will score the motifs. If the score is better than previous iteration we will keep those set of motifs so that we are always progressing in the correct direction. Then we will repeat for multiple iterations!

We have seen that in around 500 - 1000 iterations the positions of the subsequences have stabilized across iterations. However, this may take some testing over 2~5 runs based on your dataset.

Reference: bioinformatics algorithms an active learning approach and BIMM 181

TL;DR: A short explanation of Gibbs Sampling

Steps for Gibbs Sampling:

Another resource you can use is this video from our beloved professor from UCSD, Dr.Pavel.

Features

Utilize Gibbs sampling to identify motifs within the peak regions for a given transcription factor.
Compute Position Frequency/Weight Matrices (PFMs, PWMs)

Installation

You can install the package via pip:

pip install git+https://github.com/kairitanaka/gibby.git

and verify by running:

gibby -h

If you get an error that the command is not found, make sure the directory ~/.local/bin is included in your $PATH environment variable. Or consider adding the directory to your $PATH by running:

export PATH=$PATH:$HOME/.local/bin

This will allow you to simply type gibby to run the tool. You will have to repeat this for every new terminal session.

If you come across the error:

error: cannot find command 'git'

Please make sure to install Git. You can find installation instructions here.

Troubleshooting

While running the tool, you may come across the error:

OSError: Could not find Ghostscript on path. There should be either a gs executable or a gswin32c.exe on your system's path

We have observed that this error occurs for some users but not for others. Ghostscript is an application that the seqLogo package requires as part of its process in visualizing the motif; some systems appear to already have Ghostscript installed, while others do not. If on Windows, please install Ghostscript from their website (we used Ghostscript 10.03.1 for Windows (64 bit), AGPL Release). On macOS, you can use Homebrew to install the application:

brew install ghostscript

For Linux, you can use:

sudo apt-get update
sudo apt-get install ghostscript

If you're receiving other errors, please feel free to report them on Issues!

Basic Usage

gibby, from Gibby (ver 0.1.0), utilizes Gibbs Sampling to find potential motifs that are in peak regions of the genome. The potential motifs are used to generate a position frequency matrix (PFM), a position weight matrix (PWM), and a motif logo based on these matrices. The options appear as below:

gibby [-h] -p PEAK_FILE -t PEAK_FILE_TYPE -g GENOME_FASTA_FILE [-s SCORE_THRESHOLD] [-k KMER_SIZE] [-i ITERATIONS]

Required arguments

Assuming you have successfully installed Gibby, running the tool is a fairly simple task. First, it's a good idea to run gibby -h to see what options are available. You will notice that Gibby will always require three arguments to be passed: the PEAK_FILE, the PEAK_FILE_TYPE ("bed" or "homer"), and the GENOME_FASTA_FILE. In addition, there are several optional arguments such as SCORE_THRESHOLD, KMER_SIZE, and ITERATIONS.

Optional arguments

SCORE_THRESHOLD represents the minimum score that the tool will use to filter out low quality peaks. Generally, if you know the general length of the motif that you are looking for, you can specify KMER_SIZE to the general length + 5 (to add some leeway). ITERATIONS is the number of times you want to run Gibbs Sampling. Running more iterations results in better detection of motifs in exchange for the increasing time it takes to finish the task. We have set default values for these variables which strike a good balance between speed and performance.

Command Line Arguments

-p, --peak_file: Input BED/Homer file with peak information. (required)
-t, --peak_file_type: Specify peak file type: 'bed' or 'homer' file. (required)
-g, --genome_fasta_file: Genome FASTA file. (required)
-s, --score_threshold: Score threshold for filtering peaks. (optional)
-k, --kmer_size: Size of k-mers for motif finding. (optional)
-i, --iterations: Number of iterations for Gibbs sampler. (optional)

Examples

OCT4 (Pou5f1)

In this example, we will be using the same files that were used in Lab 5 of the CSE 185 course. The files you will need are the following: (1) peaks.txt which is the peak file generated by HOMER for OCT4 (which we have also included in our repo) and (2) the GRCm38.fa Mus musculus genome assembly in FASTA format which should be available in the CSE 185 public genomes folder.

In this case, we have a HOMER peak file. In addition, suppose we want to choose 75 as the score threshold to filter out low-quality peaks. Running the tool would look like this:

gibby -p ~/lab5/tagdirs/Oct4/peaks.txt -t homer -g ~/public/genomes/GRCm38.fa -s 75

You will want to make sure that you have the correct paths for the peak file and the genome file. Running the command, the tool will take some time to fully run. After finishing, you will want to take a look at the generated motif logo which visualizes which motif was most conserved among the peak regions. In our case, we got:

and sometimes its reverse complement:

Since Gibbs Sampling is a stochastic process, your graphs will look different from our results. However, what should be similar are the large letters (nucleotides) stacked at the top and their relative positioning to one another. These large sets of letters represent the motif that Gibby discovered; all the other smaller nucleotides represent "noise" or nucleotides that were not as strongly conserved among peak regions for OCT4. You may get the motif or its reverse complement.

Compare your results with the published motifs for OCT4:

Reverse complement:

Feel free to test the tool with the other two transcription factors (KLF4 and SOX2) used in Lab 5. Since they use the same genome assembly, you just need to change the HOMER peak file!

ZNF24

In this example, we will be using a ChIP-seq dataset that is for the transcription factor ZNF24. The bed file we used is from ENCODE: https://www.encodeproject.org/files/ENCFF664TYB/. The GRCh38 genome assembly can be downloaded from the UCSC Genome Browser: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/.

Since we are using a .bed peak file, we can configure the argument accordingly and use default settings for the other arguments:

gibby -p ENCFF664TYB.bed -t bed -g hg38.fa

Again, make sure the paths to the files are correct. Below we share the motif result we got, and the published motif for ZNF24. Gibby can return the motif and sometimes its reverse complement due to its stochasticity.

Gibby Results:

Forward:

Reverse Complement:

Published motif:

Forward:

Reverse Complement:

That's it! Hopefully they look similar!

Authors

Joe Hwang (j8hwang@ucsd.edu) & Kairi Tanaka (ktanaka@ucsd.edu)