/Datenanalyse-2021

Primary LanguageJupyter Notebook

Datenanalyse-2021

Students Project done at Loosolab (Max Planck Institute for Heart and Lung Research Bad Nauheim) as part of the Masters Degree "Bioinformatics and Systems Biology" at Justus-Liebig-University Gießen and University of Applied Sciences Mittelhessen (THM).

The goal of the project is an investigation of the landscape of chromatin accessibility in human cells, computation of transcription factor activity and derivation of transcription factor co-binding and regulatory circuits.

This is done with single cell ATAC-Seq data provided by the CATlas database that is processed in multiple steps by various work packages.

Work package 1:

WP1 uses scanpy and episcanpy for processing of anndata files, which includes filtering and clustering of the provided data.

After clustering, peaks are assigned to genes and non-assigned/intergenics are removed. The SCSA script is used to assign cell types to each cluster, based on the most relevant marker genes.

Work package 2:

The preprocessing of snATAC-seq .fastq files using SnapATAC is done by WP2. First, a .snap file is created using SnapTools and the respective .fastq files. This .snap file is read into R as a snap object, which is essential for further processing of the data. Now the following steps will be carried out using the R package SnapATAC: Barcode filtering, bin filtering, dimensionality reduction (clustering) and finally peak calling of each cluster. Apart from the processing steps, we provide a Python script to assign cell types to the respective clusters using Uropa, Panglao DB and the created peak files.

Work package 3:

WP3 regards cell-type-specific transcription factor activity. The data provided by WP2 is used to analyze the transcription factor activity of different tissues and their clusters. Comparisons are drawn between different cell types and tissues to gain insight into the transcription factors responsible for their identity.

Work package 4:

WP 4 takes a closer look at chromatin peak co-accessibilities. Peaks describe areas of open chromatin. The Cicero algorithm calculates if there are distal peaks to a peak that show the same pattern of open or closed chromatin. Open promoter peaks and corresponding distal peaks are displayed, indicating whether there are open chromatin sites in the distal genome for transcriptional regulation by enhancers or transcription factors.

Work package 5:

The goal of WP5 is to discover new motifs based on the data generated by WP3. The WP provides a pipeline to prepare and automatically run the motif discovery pipeline. It also offers some scripts to further analyse the newly found motifs and give first insights in their potential biologial meaning.

Work package 6:

WP6 is searching for and comparing transcription factor co-occurrences based on the data of WP1, WP2, WP3 and WP5 using the python package TF-COMB. WP6 offers jupyter notebooks for finding transcription factor co-occurrences, a notebook for looking into the correlation of the binding orientation and the binding distance between transcription factors and a notebook for a closer look at the difference in the binding distance of same transcription factor co-occurrences in different clusters.

workflow

More information about the individual work packages can be gathered by following the links provided above or taking a look at the wiki.