wossvc_hiv

Pipeline for detecting clusters of cases in HIV data from WoSSVC

Install

git clone https://github.com/josephhughes/wossvc_hiv.git
cd wossvc_hiv
conda env create -f environment.yml
conda activate wossvc_hiv

The pipeline

Processing the sequences from a run

Query sierrapy to obtain the subtype assignment and drug resistance information.
Add the sequences to an alignment with a reference set of sequences using mafft.
Reconstruct the phylogeny using raxml
Plot the phylogeny using ete3.
Provide a run report word document.

Processing the subtype C sequences through hivtrace [optional]

Filter subtype C sequences
Run hivtrace
Obtain cluster assignment from hivnetworkcsv
Compare the changes in cluster assignment to the previous cluster assignment
Add the clustering information to the report.

Running the pipeline

run_HIV_analysis_pipeline.sh -- program to assign subtypes, detect DRMs and conduct HIVtrace clustering:\n
        -h      show this help text
        -f      input sequences in fasta format
        -t      date for the output folder dd-mm-yyyy
        -d      [OPTIONAL] if wanting hivtrace clustering then path to database of fasta sequences previously processed by HIVtrace
        -r      [OPTIONAL] only necessary if -d is provided, then path to the registry for the database of sequences already provided"

Example:

./scripts/run_HIV_analysis_pipeline.sh -f test_data/2018_seqs.fa -r test_data/Pre2018_registry.txt -d test_data/Pre2018_seqs.fa -t 01-01-2019

2018_seqs.fa is a set of sequences that need to be processed. Here these are a set of sequences from 2018.

Pre2018_registry.txt is a text-tab delimited file of previously clustered sequences. This file represents samples that are from before 2018 and have been assigned as subtype C.

Pre2018_seqs.fa is the set of fasta files that are in the registry.

01-01-2019 is the date for the output files and date added to the registry.

Follow on example:

./scripts/run_HIV_analysis_pipeline.sh -f test_data/2019_seqs.fa -r Results_01-01-2019/01-01-2019.registry.txt -d Results_01-01-2019/01-01-2019_sequence_db.fa -t 01-01-2020

Here, we are using as an input the registry and compiled sequences from the previous run to increment the sequence file and registry.

Outputs

Results_dd-mm-yyyy

dd-mm-yyyy_RunOverview.docx

This is a report for the run which contains the number of sequences assigned to each subtype, the phylogeny for the sequences in the run and the results of the clustering

dd-mm-yyyy_DRM-overview.txt

Text-tab delimited file with the resistance associated mutations detected by sierrapy.

dd-mm-yyyy.log

Summary results of the clustering

dd-mm-yyyy.registry.txt

The previous and new registry depending on date (text-tab delimited) - IMPORTANT

dd-mm-yyyy_sequence_db.fa

The previous and new dataset of subtype C sequences - IMPORTANT

dd-mm-yyyy_reports

Output of sierrapy formatted for word

hivtrace_dd-mm-yyy

Outputs from hivtrace

RAxML

Output from the phylogenetic analysis

Dependencies

wossvc_hiv uses the following tools:

mafft
raxml
conda
biopython
python3.6
python-docx
tn93
pyqt
ete3 (https://github.com/etetoolkit/ete)
sierrapy (https://github.com/hivdb/sierra-client/tree/master/python)
hivtrace (https://github.com/veg/hivtrace)