/GLORI-tools

bioinformatic pipeline for GLORI

Primary LanguagePythonMIT LicenseMIT

GLORI-tools v1.0

GLORI-tools currently works, but is still being optimized for a better user experience.

*🔥NEW🔥 Hi, many friends have encountered various issues while building the STAR index. Please make sure you run this step with enough memory. Additionally, you can use chromosomes 1-22 + X, Y, Z in hg38 as your genome. Please ensure that your genome sequence's name begins with "chr" as we have only tested GLORI on human and mouse genomes.

News

  • As of June 15th, 2023, We have updated the format of the .totalCR.txt file to make it more user-friendly and easier to read. Please see Section 5.3 for details.
  • As of June 23th, 2023, We have added an ciatiaon information for GLORI-tools.
  • As of June 23th, 2023, an additional run parameter has been added to obtain a comprehensive list of all annotated A citations. Please refer to Section 4.5 for details

Table of content

  • Background
  • Pre-processing the raw sequencing data
  • Installation and Requirement
  • Example and Usage
  • Maintainers and Contributing
  • License

Background

GLORI

We developed an absolute m6A quantification method (“GLORI”) that is conceptually similar bisulfite sequencing-based quantification of DNA 5-methylcytosine. GLORI relies on glyoxal and nitrite-mediated deamination of unmethylated adenosines while keeping m6A intact, thereby achieving specific and efficient m6A detection.

Pre-processing the raw sequencing data

Annotations

  • GLORI-tools exclusively accepts single-end sequencing reads. Prior to use, it is critical to ensure that the input reads are A-to-G converted reads. Therefore, it is important to carefully verify the orientation of the reads before processing them.
  • Before inputting data into GLORI-tools, data cleaning is necessary. This involves removing sequencing adapters, low-quality bases, PCR duplicates based on unique molecular identifiers (UMIs), and finally, removing the UMIs themselves.

Installation

  • trim_galore
  • seqkit
  • FASTX-Toolkit (including fastx_trimmer)

Example code

  • trim_galore -q 20 --stringency 1 -e 0.3 --length {length(5’UMI) +25nt} --path_to_cutadapt {cutadapter} --dont_gzip -o {output_dir1} {inputfile}
  • seqkit rmdup -j {Thread} -s -D {output_dupname} {filtered_file} > {filter_file2}
  • fastx_trimmer -Q 33 -f {length(5′UMI) +1nt} -i {filter_file2} -o {filter_file3}

GLORI-tools

GLORI-tools is a bioinformatics pipeline tailored for the analysis of high-throughput sequencing data generated by GLORI. GLORI pipeline

Installation and Requirement

Installation

GLORI-tools is written in Python3 and is executed from the command line. To install GLORI-tools simply download the code and extract the files into a GLORI-tools installation folder.

GLORI-tools needs the following tools to be installed and ideally available in the PATH environment:

  • STAR ≥ v2.7.5c
  • bowtie (bowtie1) version ≥ v1.3.0
  • samtools ≥ v1.10
  • python ≥ v3.8.3

GLORI-tools needs the following python package to be installed:

pysam,pandas,argparse,time,collections,os,sys,re,subprocess,multiprocessing,copy,numpy,scipy,math,sqlite3,Bio,statsmodels,itertools,heapq,glob,signal

Example and Usage:

1. Generate annotation files (required)

1.1 download files for annotation (required, using hg38 as example):

  • wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt
  • wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz

1.2 download reference genome and transcriptome

  • wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

  • wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_rna.fna.gz

1.3 Unify chromosome naming in GTF file and genome file:

  • python ./get_anno/change_UCSCgtf.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf -j GCF_000001405.39_GRCh38.p13_assembly_report.txt -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens

2. get reference for reads alignment (required)

2.1 build genome index using STAR

  • python ./pipelines/build_genome_index.py -f $genome_fastafile -p 20 -pre hg38

you will get:

  • $ hg38.rvsCom.fa
  • $ hg38.AG_conversion.fa
  • the corresponding index from STAR

2.2 build transcriptome index using bowtie

2.2.1 get the longest transcript for genes (required)

Get the required annotation table files:

  • python ./get_anno/gtf2anno.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl

  • awk '$3!~/_/&&$3!="na"' GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl | sed '/unknown_transcript/d' > GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2

Get the longest transcript:

  • python ./get_anno/selected_longest_transcrpts_fa.py -anno GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2 -fafile GCF_000001405.39_GRCh38.p13_rna.fa --outname_prx GCF_000001405.39_GRCh38.p13_rna2.fa

2.2.2 build reference with bowtie

  • python ./pipelines/build_transcriptome_index.py -f $ GCF_000001405.39_GRCh38.p13_rna2.fa -pre GCF_000001405.39_GRCh38.p13_rna2.fa

you will get:

  • $ GCF_000001405.39_GRCh38.p13_rna2.fa.AG_conversion.fa
  • the corresponding index from bowtie

3. get_base annotation (optional)

3.1 get annotation at single-base resolution

  • python ./get_anno/anno_to_base.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2 -threads 20 -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.baseanno

3.2 get required annotation file for further removal of duplicated loci

  • python ./get_anno/gtf2genelist.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens -f GCF_000001405.39_GRCh38.p13_rna.fa -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist > output2

  • awk '$6!~/_/&&$6!="na"' GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist > GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist2

3.3 Removal of duplicated loci in the annotation file

  • python ./get_anno/anno_to_base_remove_redundance_v1.0.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.baseanno -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.noredundance.base -g GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist2

Finally, you will get annotation files:

  • GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.noredundance.base

4. alignment and call sites (required)

GLORI-tools takes cleaned reads as input and finally reports files for the conversion rate (A-to-G) of GLORI for each gene and m6A sites at single-base resolution with corresponding A rate representative for modification level.

4.1 Example shell scripts

Used files
Thread=1
genomdir=your_dir
genome=${genomdir}/hg38.AG_conversion.fa
genome2=${genomdir}/hg38.fa
rvsgenome=${genomdir}/hg38.revCom.fa
TfGenome=${genomdir}/GCF_000001405.39_GRCh38.p13_rna2.fa.AG_conversion.fa
annodir=your_dir
baseanno=${annodir}/GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.noredundance.base
anno=${annodir}/GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2
outputdir=your_dir
tooldir=/tool_dir/GLORI -tools
prx=your_prefix
file=your_cleaned_reads

4.2 Call m6A sites annotated the with genes

  • python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -b $baseanno -pre ${prx} -o $outputdir --combine --rvs_fac

4.3 Call m6A sites without annotated genes.

  • python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -pre ${prx} -o $outputdir --combine --rvs_fac

  • In this situation, the background for each m6A sites is the overall conversion rate.

  • The site list obtained by the above two methods is basically the similar, and there may be a few differential sites in the list.

4.4 mapping with samples without GLORI treatment

  • python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -Tf ${TfGenome} -a $anno -pre ${prx} -o $outputdir --combine --untreated

4.5 call all the annotated A cites

  1. with annotation:
  • python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -b $baseanno -pre ${prx} -o $outputdir --combine --rvs_fac -c 1 -C 0 -r 0 -p 1.1 -adp 1.1 -s 0
  1. or without annotation
  • python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -pre ${prx} -o $outputdir --combine --rvs_fac -c 1 -C 0 -r 0 -p 1.1 -adp 1.1 -s 0

5 Resultes:

5.1 Output files of GLORI

Output files Interpretation
${your_prefix}_merged.sorted.bam The overall mapping of reads in .bam format
${your_prefix}_referbase.mpi The text pileup output from .bam files
${your_prefix}.totalCR.txt The text file containing the median value of the overall A-to-G conversion rate for each transcriptome and gene
${your_prefix}.totalm6A.FDR.csv The final list of m6A sites obtained, with the A rate serving as the m6A level

5.2 GLORI sites files

Columns Interpretation
Chr chromosome
Sites genomic loci
Strand strand
Gene annotated gene
CR conversion rate for genes
AGcov reads coverage with A and G
Acov reads coverage with A
Genecov mean coverage for the whole gene
Ratio A rate for the sites/ or methylation level for the sites
Pvalue test for A rate based on the background
P_adjust FDR ajusted P value

5.3 GLORI conversion rate (.totalCR) files

Columns Interpretation
SA chromosomes or genes
A-to-G_ratio A-to-G conversion rate for each chromosome and gene

Maintainers and Contributing

Licences

  • Released under MIT license

Citation

please cite: Liu C., Sun H., Yi Y., Shen W., Li K., Xiao Y., et al. (2022). Absolute quantification of single-base m6A methylation in the mammalian transcriptome using GLORI. Nat. Biotechnol. (https://www.nature.com/articles/s41587-022-01487-9)