Annotate-Contigs:

This tool takes a k-mer/contig table as input and produces an annotation file with mapping information for each k-mer/contig including position, intron/exon/intergenic location, gene name, CIGAR etc. Should work with any reference genome and annoattion as input. In this pipeline, we use two alignment tools, STAR and Minimap2. STAR is used for aligning sequences that are 200 bases or shorter, while Minimap2 is used for aligning sequences greater than or equal 200 bases.

This tool was initially thought as a less restrictive alternative to DEkupl-annotation. So many of its aspects are similar.

Usage
Installation
- Option 1: With singularity
- Option 2: From source
Configuration
Output files
Ontology

Usage

In order to run the tool, you will need at least 4 specific files
- Input file : A tsv/csv file with at least 2 named columns. One contains the sequences you want to annotate, and each sequence must have a unique identifier. Typical input files are Dekupl-run/Kamrat outputs (example available in data/input_table.tsv).
- Config file : As the pipeline is designed with snakemake, any run requires a configuration file. See below for specifications of available parameters.
- Genome and annotation files : Associated fasta and gtf files of an organism (gz). Typically downloaded from Ensembl or Gencode websites.(examples available in data/)

Installation

Clone the Repository

To download this project, use the following command:

```
git clone https://github.com/Transipedia/Annotate-contigs && cd Annotate-contigs
```

We recommand to use singularity to use the tool, or using the manual installation.

Option 1: With singularity

Step 1: Upload Singularity Image You can upload the singularity image directly from the link

wget https://zenodo.org/records/13789508/files/annotatecontig.sif?download=1 -O annotatecontig.sif

Step 2: Create your configuration file This tool is designed to work with Snakemake, which means that all user inputs must be defined in a configuration file (config.json). You can find an example of this configuration file in the repository. A comprehensive list of all parameters is provided in the following section.
Step 3: Run with mounted volumes It is advised to mount certain volumes (input/output directories). By default, a Singularity image cannot access external data. To fix this, you need to mount your directories as volumes. Using the parameter -B /store:/store tells Singularity to reference your store directory when mentioned (notably in your configuration file). It is recommended that all your input files be located in the /store directory.
```
singularity -v run -B /home:/home annotatecontig.sif -s ./Snakefile --configfile ./config.json --cores $nb_cores  
```

Option 2: From conda

Step 1: Install dependancies. Before using the tool, install the dependencies. You can install them manually using the conda environnement file annotatecontig.yml :
```
conda env create -f annotatecontig.yml
conda activate AnnotateContig
```

Step 2: Edit config file & run with Snakemake.

snakemake --configfile config.json --cores $nb_cores

Configuration :

inside the config.json you find all the parameters:

Mandatory parameters :

mode: can be either "index" or "table".

"index": Use this when running the pipeline for the first time to build the indexes. "table": Use this if the STAR and Minimap2 indexes already exist, and you only need to generate the table.
input_file: Path to the file containing sequences to annotate (supports tsv/csv, gzipped or uncompressed). Example in (data/input_file.tsv)
map_to: Name of the organism to which the tool will map your sequences.
reference: Path to the fasta.gz file used to build the index for the specified organism. Example for test reference
annotation: Path to the gtf.gz file used to build the index for the specified organism. Example for test annotation
preset: (Default : "map-ont") Adjusts internal parameters of Minimap2 (e.g., k-mer size, scoring schemes, alignment heuristics) to optimize performance and accuracy for specific data types, you can find other presets here.
minimap2_index: Path to the pre-built Minimap2 index for the organism, if previously created. if "index" mode, add ""
star_index: Path to the pre-built STAR index for the organism, if previously created. if "index" mode, add ""

The preset used for index building must be consistent. Some presets may not provide information about chimeric reads. In such cases, you may need to build the index again using a different preset.

*About the GTF

Only the "exon" features of the GTF file will be used. In order for the program to run properly, the mandatory attributes (column 9) are : "gene_id", "transcript_id", "gene_type".

Optionnal parameters (and default values) :

sequence_col: (Default :"contig"). Name of the column in input file containing the sequences to annotate.
id_col: (Default:"tag"). Name of the column in input file containing the unique identifier of the sequence.
output_dir: (Default:"./output"). Path to where the results will be generated.
keep_col: (Default:"all"). Either "all" or a list of column names you want to keep from the input file.
library_type: (Default:"rf"). Strandedness: "rf", "fr" or "unstranded".
supp_map_to: (Default:[""]). List of supplementary reference names you want to map your sequences to, with no further information (using blast).
supp_map_to_fasta: For each reference in supp_map_to, path to its fasta sequence. An exemple of a typical fasta file you could use (Human repeats from Dfam) is available in data/.

*About supplementary alignment

Any amount of supplementary alignment columns can be added to the output. For each supplementary reference provided, a single column will be added at the end of the output file specifying where the annotated sequence was aligned on this reference.

Example: with a reference of human repeats provided in this repository (data/human_repeat_ref.fasta):

"supp_map_to":["HumanRepeats"],
"supp_map_to_fasta" : ["/home/Documents/Annotate-contigs/data/human_repeat_ref.fasta"],

Example with Multiple References: If you have two supplementary references, e.g., human repeats and viral elements, the configuration would look like this:

"supp_map_to": ["HumanRepeats", "ViralElements"],
"supp_map_to_fasta": ["/home/Documents/Annotate-contigs/data/human_repeat_ref.fasta","/home/Documents/Annotate-contigs/data/viral_elements_ref.fasta" ]

Output file

Table merged_annotation.tsv, summarizing for each contig, its location on the genome (if it's aligned), the sequence alignment informations, and other optionnal alignment informations.

N.B : You will also find some intermediate files in the output folder, specifically query_lt_200.fa and query_gt_200.fa.

If query_lt_200.fa is empty, it means that all the sequences in your query have a length of less than 200 bases, so you will have empty output files from STAR.
If query_gt_200.fa is empty, it means that all the sequences in your query have a length greater than 200 bases, so you will have empty output files from Minimap2.

Annotated values

Term	Type	Description
mapped_to	Str	Reference to which the sequence was aligned
chromosome	Str	Chromosome
start	Int	Beginning of the alignment on the reference
end	Int	End of the alignment on the reference
strand	Char	Strand of the alignment (+/-). set to "." in unstranded data.
cigar	Str	CIGAR string from the SAM alignment.
nb_insertion	Int	Number of insertions in the alignment (infered from cigar)
nb_deletion	Int	Number of deletions in the alignment (infered from cigar)
nb_splice	Int	Number of splices in the alignment (infered from cigar)
nb_snv	Int	Number of SNV in the contigs (computed as the number of mismatches minus indels)
clipped_3p	Int	Number of clipped bases (soft/hard) from 3prim contig
clipped_5p	Int	Number of clipped bases (soft/hard) from 5prim contig
query_cover	Float	Fraction of the query that have been aligned to the reference
alignment_identity	Float	Fraction of exact match over the query alignment length (splices do not count)
nb_hit	Int	Number of alignment given for the contig (NH field)
nb_mismatches	Int	Number of mismatches in the alignment (NM field)
gene_id	Str	Overlapping gene ID (from GTF ID field)
gene_symbol	Str	Overlapping gene symbol (from GTF Name field)
gene_biotype	Str	Overlapping gene biotype (from GTF biotype field)
gene_strand	Char	Overlapping gene strand (+/-)
as_gene_id	Str	Overlapping antisense gene ID (from GFF ID field). Defined only when working with stranded datas.
as_gene_symbol	Str	Overlapping antisense gene symbol (from GFF Name field). Defined only when working with stranded datas.
as_gene_strand	Char	Overlapping antisense gene strand (+/-). Defined only when working with stranded datas.
as_gene_biotype	Str	Overlapping antisense gene biotype (from GFF biotype field). Defined only when working with stranded datas.
is_exonic	Bool	Overlap between an exon and the contig. Same strand if working with stranded datas, both strand otherwise.
is_intronic	Bool	Overlap between an intron and the contig. Same strand if working with stranded datas, both strand otherwise.
is_chimeric	Bool	The contig contains a chimeric junction
is_circ	Bool	The chimeric junction behaves like a circular RNA.
seg1_cj	Str	First segment of the chimeric junction
seg2_cj	Str	Second segment of the chimeric junction