This tool takes a k-mer/contig table as input and produces an annotation file with mapping information for each k-mer/contig including position, intron/exon/intergenic location, gene name, CIGAR etc. Should work with any reference genome and annoattion as input. In this pipeline, we use two alignment tools, STAR and Minimap2. STAR is used for aligning sequences that are 200 bases or shorter, while Minimap2 is used for aligning sequences greater than or equal 200 bases.
This tool was initially thought as a less restrictive alternative to DEkupl-annotation. So many of its aspects are similar.
- In order to run the tool, you will need at least 4 specific files
- Input file : A tsv/csv file with at least 2 named columns. One contains the sequences you want to annotate, and each sequence must have a unique identifier. Typical input files are Dekupl-run/Kamrat outputs (example available in data/input_table.tsv).
- Config file : As the pipeline is designed with snakemake, any run requires a configuration file. See below for specifications of available parameters.
- Genome and annotation files : Associated fasta and gtf files of an organism (gz). Typically downloaded from Ensembl or Gencode websites.(examples available in data/)
To download this project, use the following command:
```
git clone https://github.com/Transipedia/Annotate-contigs && cd Annotate-contigs
```
We recommand to use singularity to use the tool, or using the manual installation.
-
Step 1: Upload Singularity Image You can upload the singularity image directly from the link
wget https://zenodo.org/records/13789508/files/annotatecontig.sif?download=1 -O annotatecontig.sif
-
Step 2: Create your configuration file This tool is designed to work with Snakemake, which means that all user inputs must be defined in a configuration file (
config.json
). You can find an example of this configuration file in the repository. A comprehensive list of all parameters is provided in the following section. -
Step 3: Run with mounted volumes It is advised to mount certain volumes (input/output directories). By default, a Singularity image cannot access external data. To fix this, you need to mount your directories as volumes. Using the parameter
-B /store:/store
tells Singularity to reference your store directory when mentioned (notably in your configuration file). It is recommended that all your input files be located in the/store
directory.singularity -v run -B /home:/home annotatecontig.sif -s ./Snakefile --configfile ./config.json --cores $nb_cores
-
Step 1: Install dependancies. Before using the tool, install the dependencies. You can install them manually using the conda environnement file annotatecontig.yml :
conda env create -f annotatecontig.yml conda activate AnnotateContig
-
Step 2: Edit config file & run with Snakemake.
snakemake --configfile config.json --cores $nb_cores
inside the config.json
you find all the parameters:
-
mode: can be either "index" or "table".
"index": Use this when running the pipeline for the first time to build the indexes. "table": Use this if the STAR and Minimap2 indexes already exist, and you only need to generate the table.
-
input_file: Path to the file containing sequences to annotate (supports tsv/csv, gzipped or uncompressed). Example in (data/input_file.tsv)
-
map_to: Name of the organism to which the tool will map your sequences.
-
reference: Path to the fasta.gz file used to build the index for the specified organism. Example for test reference
-
annotation: Path to the gtf.gz file used to build the index for the specified organism. Example for test annotation
-
preset: (Default : "map-ont") Adjusts internal parameters of Minimap2 (e.g., k-mer size, scoring schemes, alignment heuristics) to optimize performance and accuracy for specific data types, you can find other presets here.
-
minimap2_index: Path to the pre-built Minimap2 index for the organism, if previously created. if "index" mode, add ""
-
star_index: Path to the pre-built STAR index for the organism, if previously created. if "index" mode, add ""
The preset used for index building must be consistent. Some presets may not provide information about chimeric reads. In such cases, you may need to build the index again using a different preset.
Only the "exon" features of the GTF file will be used. In order for the program to run properly, the mandatory attributes (column 9) are : "gene_id", "transcript_id", "gene_type".
-
sequence_col: (Default :"contig"). Name of the column in input file containing the sequences to annotate.
-
id_col: (Default:"tag"). Name of the column in input file containing the unique identifier of the sequence.
-
output_dir: (Default:"./output"). Path to where the results will be generated.
-
keep_col: (Default:"all"). Either "all" or a list of column names you want to keep from the input file.
-
library_type: (Default:"rf"). Strandedness: "rf", "fr" or "unstranded".
-
supp_map_to: (Default:[""]). List of supplementary reference names you want to map your sequences to, with no further information (using blast).
-
supp_map_to_fasta: For each reference in
supp_map_to
, path to its fasta sequence. An exemple of a typical fasta file you could use (Human repeats from Dfam) is available in data/.
Any amount of supplementary alignment columns can be added to the output. For each supplementary reference provided, a single column will be added at the end of the output file specifying where the annotated sequence was aligned on this reference.
Example: with a reference of human repeats provided in this repository (data/human_repeat_ref.fasta):
- "supp_map_to":["HumanRepeats"],
- "supp_map_to_fasta" : ["/home/Documents/Annotate-contigs/data/human_repeat_ref.fasta"],
Example with Multiple References: If you have two supplementary references, e.g., human repeats and viral elements, the configuration would look like this:
- "supp_map_to": ["HumanRepeats", "ViralElements"],
- "supp_map_to_fasta": ["/home/Documents/Annotate-contigs/data/human_repeat_ref.fasta","/home/Documents/Annotate-contigs/data/viral_elements_ref.fasta" ]
- Table
merged_annotation.tsv
, summarizing for each contig, its location on the genome (if it's aligned), the sequence alignment informations, and other optionnal alignment informations.
N.B : You will also find some intermediate files in the output folder, specifically query_lt_200.fa and query_gt_200.fa.
- If query_lt_200.fa is empty, it means that all the sequences in your query have a length of less than 200 bases, so you will have empty output files from STAR.
- If query_gt_200.fa is empty, it means that all the sequences in your query have a length greater than 200 bases, so you will have empty output files from Minimap2.
Term | Type | Description |
---|---|---|
mapped_to | Str | Reference to which the sequence was aligned |
chromosome | Str | Chromosome |
start | Int | Beginning of the alignment on the reference |
end | Int | End of the alignment on the reference |
strand | Char | Strand of the alignment (+/-). set to "." in unstranded data. |
cigar | Str | CIGAR string from the SAM alignment. |
nb_insertion | Int | Number of insertions in the alignment (infered from cigar) |
nb_deletion | Int | Number of deletions in the alignment (infered from cigar) |
nb_splice | Int | Number of splices in the alignment (infered from cigar) |
nb_snv | Int | Number of SNV in the contigs (computed as the number of mismatches minus indels) |
clipped_3p | Int | Number of clipped bases (soft/hard) from 3prim contig |
clipped_5p | Int | Number of clipped bases (soft/hard) from 5prim contig |
query_cover | Float | Fraction of the query that have been aligned to the reference |
alignment_identity | Float | Fraction of exact match over the query alignment length (splices do not count) |
nb_hit | Int | Number of alignment given for the contig (NH field) |
nb_mismatches | Int | Number of mismatches in the alignment (NM field) |
gene_id | Str | Overlapping gene ID (from GTF ID field) |
gene_symbol | Str | Overlapping gene symbol (from GTF Name field) |
gene_biotype | Str | Overlapping gene biotype (from GTF biotype field) |
gene_strand | Char | Overlapping gene strand (+/-) |
as_gene_id | Str | Overlapping antisense gene ID (from GFF ID field). Defined only when working with stranded datas. |
as_gene_symbol | Str | Overlapping antisense gene symbol (from GFF Name field). Defined only when working with stranded datas. |
as_gene_strand | Char | Overlapping antisense gene strand (+/-). Defined only when working with stranded datas. |
as_gene_biotype | Str | Overlapping antisense gene biotype (from GFF biotype field). Defined only when working with stranded datas. |
is_exonic | Bool | Overlap between an exon and the contig. Same strand if working with stranded datas, both strand otherwise. |
is_intronic | Bool | Overlap between an intron and the contig. Same strand if working with stranded datas, both strand otherwise. |
is_chimeric | Bool | The contig contains a chimeric junction |
is_circ | Bool | The chimeric junction behaves like a circular RNA. |
seg1_cj | Str | First segment of the chimeric junction |
seg2_cj | Str | Second segment of the chimeric junction |