/hiv1-rloop

💻 Analysis codes and bioinformatic pipelines for paper "HIV-1-induced host genomic R-loops dictate HIV-1 integration site selection"

Primary LanguagePython

hiv1-rloop

rloop

This repository provides analysis codes and bioinformatics pipelines for article entitled "HIV-1-induced host genomic R-loops dictate HIV-1 integration site selection."

This repository is organized into several directories:

dripc-seq-processing

Related to Fig. 1, 2, 4 and Supplementary Fig. 1, 2, 4

This directory conatins the processing pipeline for DRIPc-seq processing for R-loop identification. Sequencing reads were quality-controlled using FastQC v2.8, and adapters were trimmed using Trim Galore! v0.6.6 based on Cutadapt v2.8. Trimmed reads were aligned to hg38 reference genome using bwa v0.7.17-r1188. Read deduplication and peak calling was done using MACS v2.2.7.1. Since it is known that R-loops appear as both narrow and broad peaks in DRIPc-seq read alignment due to its variable length, two independent MACS2 callpeak runs were done for narrow and broad peak calling. Narrow and broad peaks were merged using Bedtools v2.26.0.

rna-seq-processing

Related to Fig. 2

This directory contains the RNA-seq data processing pipeline for the quantification of protein-coding gene expression levels. RNA-seq reads were quality-controlled and adapter-trimmed as in DRIPc-seq processing. Processed reads aligned to hg38 reference genome with GENCODE v38 gene annotation using STAR v2.7.3a. Gene expression quantification was done using RSEM v1.3.1.

rna-seq-processing-te

Related to Supplementary Fig. 2

This directory contains the RNA-seq data processing pipeline for the quantification of transposable element expression levels using TEtranscripts v2.2.1. Processed reads were first aligned to hg38 reference genome with GENCODE and RepeatMasker TE annotation using STAR v2.7.3a. In this case, STAR options were modified as follows in order to utilize multimapping reads in downstream analyses: --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 --outMultimapperOrder random --runRNGseed 77 --outSAMmultNmax 1 --outFilterType BySJout --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000. Expression levels of TEs were quantified as read counts by TEcount command.

is-rloop-colocalization

Related to Fig. 4

This directory contains the analysis pipeline for testing integration site vs R-loop colocalization. Enrichment of integration sites nearby R-loop peaks were tested by randomized permutation test. Randomzied R-loop peaks were generated using bedtools shuffle so that the number and the length distribution of the R-loop peaks were fully preserved during the randomization process. Similarly, integration sites were also randomized using bedtools shuffle command. Randomization was done for 100 times. Note that ENCODE blacklist regions were excluded while shuffling R-loops and integration sites to exclude inaccessible genomic regions from the analysis. For each of the observed (or randomized) integration sites, the closest observed (or randomized) R-loop peak and the corresponding genomic distance were identified using bedtools closest command. The distribution of the genomic distances was displayed to show the local enrichmen tof the integration sites, in terms of the increased proportion of integration sites within the 30kb window centered on R-loops compared to their randomized counterparts.

is-seq-processing

Related to Fig. 4

This directory contains the processing pipeline for HIV-1 integration site-sequencing processing. Quality-control for HIV-1 integration site-sequencing reads were done using FastQC v0.11.9. To discard primers and linkers specific for integration site-sequencing from reads, we used Cutadapt v2.8 with options -u 49 -U 38 --minimum-length 36 --pair-filter any --action trim -q0,0 -a [LINKER] -A TGCTAGAGATTTTCCACACTGACTGGGTCTGAGGG -A GGGTCTGAGGG --no-indels --overlap 12. This allowed the firts position of a read alignment directly represent the genomic position of HIV-1 integration. Processed reads were aligned to hg38 reference genome using bwa v0.7.17-r1188, and integration sites were identified by an in-house Python script. Genomic positions that are supported by more than five read alignments were considered as HIV-1 integration sites.

vector-is-seq-processing

Related to Fig. 5

This directory contains the pipeline for vector IS-seq data processing. Pipeline details are same with is-seq-processing.

Citation

Park, K., Lee, D., Jeong, J., Lee, S., Kim, S., & Anh, S. Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. bioRxiv (2024)

@article{park2024human,
  title={Human immunodeficiency virus-1 induces and targets host genomic R-loops for viral genome integration},
  author={Park, Kiwon and Lee, Dohoon and Jeong, Jiseok and Lee, Sungwon and Kim, Sun and Ahn, Kwangseog},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}