/The-shadow-of-HIV

🦠Research project in Bioinformatics Institute 2023-2024

Primary LanguageJupyter Notebook

[logo-bi-18-7| width=10px] logo-bi-18-3


The shadow of HIV: searching for indirect signs of HIV infection in cell-free DNA samples

Authors

  • Ilia Popov, MD ORCID logo
  • Daria Nekrasova ORCID logo
  • Dorzhi Badmadashiev ORCID logo

Supervisors

  • Alisa Morshneva ORCID logo
  • Polina Kozyulina ORCID logo

Table of contents

Introduction

Overall we had:

  • 39 HIV+ samples (IonTorrent)
  • 754 HIV- samples (IonTorrent)
  • 54 HIV- samples (BGI)

Cell free DNA is quite an exotic data to analyze, especially in terms of microbiology, that is why all tresholds are not so strict.
First two steps of the study: "Unmapped reads extraction" & "Assigning taxonomic labels" were made on the server.
All further steps that included data analysis were performed locally.
To perform every step HIV_shadow conda envinroment was used

Pipeline

Overview

pipeline pipeline

Figure 1. The whole pipeline overview.

Unmapped reads extraction

IonTorrent samples

IonTorrent samples were already mapped to the human genome and files were presented in .bam format. Unmapped reads were extracted using samtools v.1.20.1
See Snakefiles/Snakefile_IonTorrent file for details.

BGI samples

BGI samples were presented in raw .fastq.gz format. They were mapped to the human genome (hg19, NCBI build 372) using bowtie2 v.2.5.3.3 Then unmapped reads were also extracted usint samtools v.1.20.1
See Snakefiles/Snakefile_BGI file for details.

Assigning taxonomic labels

Taxonomic identification was performed with kraken2 v.2.1.3.4 utilizing full PlusPF (77GB) database5 with 0.6 confidence threshold.

Clipped image from Snakefiles with kraken2 parameters:
rule kraken:
    input:
        fastq="fastq_BGI/{sample}_unmapped.fastq",
        db="/path/to/kraken2_db" #enter path to db
    output:
        report = "kraken_report_BGI/{sample}_kraken_report.txt",
        out = "kraken_output_BGI/{sample}_kraken_output.txt"
    shell:
        """
        kraken2 --db {input.db} --output {output.out} \
        --report {output.report} --confidence 0.60 {input.fastq}
        """

Creating residual virus and microbiome profiles of two datasets

Metdata

All samples (both IonTorrent and BGI) names were organised with this pattern: "YYYYMMDD_ID" and organized to different directories (e.g. HIV & CTRL).
metadata.csv was generated using scripts/create_metadata.py script.

Clipped image from laboratory journal:
# Usage
# {path_to_script} {path_to_HIV_samples} {path_to_ctrl_samples} {output_file_name}
%run scripts/create_metadata.py HIV/ CTRL/ metadata.csv

Counts

6 counts.csv files (from species to phylum level) were parsed from kraken2 reports using KrakenTools v.1.2.6
Possible contamination filtering was performed on this step.

Self-written scripts utilizied:

Script Purpose
run_kreport2mpa.sh to use KrakenTools for ~800 files at once
find_line.py to find contaminants precisely
delete_lines.py to delete them
processing_script.py to return sample_ids to files
convert2csv.py to convert .txt files to .csv files
filter_possible_contaminants.py to filter contaminants based on the data criteria

Table 1. Scripts used to parse counts.csv files.

Contamination filtering criterias
The criteria about identifying and removing potential contamination in our data is based on the collection dates of the samples.

When analyzing cell-free DNA from various samples, ideally, the organisms (taxa) detected should be distributed somewhat randomly across different samples, depending on their source, environment, etc. If certain organisms appear only in samples that were collected on the same date, this pattern might suggest that those organisms weren't actually present in the samples originally but were introduced accidentally on that particular day—possibly during sample collection, processing, or handling.

Key Points:

  • Same Date, Same Taxon: If we find that a specific organism (taxon) appears exclusively in samples that were all collected on the same date, and this organism does not appear in samples from other dates, it might indicate contamination.
  • Cross-Verification: Check if this organism appears in other samples that are not from that specific date. If it doesn’t, this supports the contamination theory.
  • Removal of Suspected Data: To ensure the integrity of data analysis, these suspected contaminated data points should be removed before performing further analysis.

Due to limitation this filtration will be performed only on species level. Because we can filter out Klebsiella variicola that was found only on 2022/03/03, but we cannot remove the whole Klebsiella genus.

In addition, the following taxa were weeded out of the data:

  • Cutibacterium acne
  • All bacteriophages

Finding the differences in exogenous DNA composition between HIV- and HIV+ NIPT samples

Differential abundance

To find the association between clinical metadata and microbial meta-omics features MaAsLin2 v.1.7.3.7 was used.
See scripts/MaAsLin2.R script for details.

MaAsLin2 launch parameters:
fit_data = Maaslin2(input_data     = counts, 
                    input_metadata = metadata, 
                    min_prevalence = 0.01,
                    normalization  = "TSS",
                    output         = "MaAsLin2_results",
                    analysis_method = "LM",
                    max_significance = 0.05,
                    correction = "BH",
                    plot_heatmap = TRUE,
                    plot_scatter = TRUE,
                    fixed_effects  = c("HIV_status"))

MaAsLin2 results were visualized as volcano plot with Volcano_plot/volcano.R script.

Reasons for volcano plot instead of heatmap:

  1. Volcano plot allowed 2 metrics to be plotted at once: log2fc & p-value.
  2. We only have 2 groups: HIV+ and HIV-. Heatmap is useful when more groups are displayed. Volcano plot is perfect for 2 groups.
  3. Volcano plot is the classic way of displaying differential relative data.
  4. Aesthetic principles: MaAsLin2 found ~40 statistically significant taxa, the heatmap would be too high/wide (depending on configuration).

Relative abundance

Mean relative abundance barplots were visualised to determine the relative percentage of a particular taxon in samples from the HIV+ and HIV- groups Visualization was made with scripts/Bar_plot.R script.

Clipped image from laboratory journal:
# Usage
# {path_to_script} {path_to_metadata} {path_to_counts_species} {path_to_counts_genus} {path_to_counts_family} {path_to_counts_order} {path_to_counts_class} {path_to_counts_phylum}
! Rscript scripts/Bar_plot.R metadata.csv counts/counts_species_filtered.csv counts/counts_genus.csv counts/counts_family.csv counts/counts_class.csv counts/counts_order.csv counts/counts_phylum.csv

Biodiversity

α-diversity

To measure mean species diversity in HIV+ and HIV- groups 3 α-diversity indices were estimated:

  • Shannon index8
  • Chao1 index9
  • Pielou index10

To compare the values of each index between HIV+ and HIV- groups Mann-Whitney U Test11 was used.
See scripts/Alpha_div_calculations.R & scripts/Alpha.R scripts for details.

β-diversity

To measure the extent of differentiation (distribution) of species according to HIV status β-diversity in 2 metrics:

  1. Bray-Curtis dissimilarity12
  2. Jaccard similarity13

To compare the values of each metric between HIV+ and HIV- groups PERMANOVA14 was used.
See Beta_div/beta_diversity.R script for details.

Rarefaction criterias:

Bray-Curtis dissimilarity

bray <- avgdist(taxon_counts, dmethod="bray", sample=10)%>%
  as.matrix()%>%
  as_tibble(rownames = "sample_id")

Jaccard similarity

jaccard <- avgdist(taxon_counts, dmethod="jaccard", sample=10)%>%
  as.matrix()%>%
  as_tibble(rownames = "sample_id")

Core microbiota

The script scripts/core_microbiota_HIV.py was used to draw the core microbiota graphs.

Results

Overview

main-results main-results

Figure 2. Main results overview.

Counts distribution

Counts distribution graphs were made with scripts/describe.py script

Clipped image from laboratory journal:
# Usage
# {path_to_script} {path_to_input_file} {taxonomic_level}
%run scripts/describe.py "counts/counts_species_filtered.csv" Species
Species Genus Family Order Class Family

Table 2. Counts distribution on every taxonomic level.

It is clearly can be seen that the the distribution graph is shifted to the right in all cases.

Differential abundance

diff-abund diff-abund

Figure 3. Volcano plot with differential bacterial abundance.

Relative abundance

rel-abund rel-abund

Figure 4. Mean Relative Abundance from species to phylum level.

α-diversity

Index M-W p-value
Shannon <0.001
Chao1 <0.001
Pielou <0.001

Table 3. α-diversity metrics.

alpha-div alpha-div

Figure 5. α-diversity visualization.

β-diversity

Index PERMANOVA p-value
Bray-Curtis dissimilarity <0.001
Jaccard similarity <0.001

Table 4. β-diversity comparison between HIV+ and HIV- groups.

beta-div beta-div

Figure 6. β-diversity visualization. A - Bray-Curtis dissimilarity. B - Jaccard similarity.

Core microbiota

HIV+ HIV-

Table 4. Core microbiota for HIV+ and HIV- groups.

Summary

Taxon Real world data Reference
Bradyrhizobium sp. BTAi1 HIV infection and subsequent antiretroviral therapy can lead to an enrichment of Bradyrhizobium in the oral microbiome 15, 16, 17
Ralstonia insidiosa HIV infection is associated with overgrowth of opportunistic pathogens including Ralstonia in the gut 16, 17, 18
Stenotrophomonas maltophilia HIV infection is associated with the occurrence of opportunistic infections including Stenotrophomonas maltophilia 19, 20
Herbaspirillum huttiense HIV-related immunosuppression can lead to opportunistic infections, including infections by Herbaspirillum 21, 22
Ralstonia pickettii HIV-related immunosuppression can lead to infections by unusual pathogens like Ralstonia pickettii 16, 17, 18, 23
Microbacterium sp. Y-01 HIV can compromise the immune system, increasing susceptibility to infections by less common bacteria, including Microbacterium 23

Table 5. The Shadow of HIV itself.

Footnotes

  1. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). ↩ ↩2

  2. Homo sapiens genome assembly GRCh37. NCBI https://www.ncbi.nlm.nih.gov/data-hub/assembly/GCF_000001405.13/. ↩

  3. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). ↩

  4. Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019). ↩

  5. PlusPF. https://genome-idx.s3.amazonaws.com/kraken/pluspf_20240112/inspect.txt. ↩

  6. Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17, 2815–2839 (2022). ↩

  7. Mallick, H. et al. Multivariable association discovery in population-scale meta-omics studies. PLOS Comput. Biol. 17, e1009442 (2021). ↩

  8. Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948). ↩

  9. Chao, A. & Bunge, J. Estimating the Number of Species in a Stochastic Abundance Model. Biometrics 58, 531–539 (2002). ↩

  10. Pielou, E. C. The measurement of diversity in different types of biological collections. J. Theor. Biol. 13, 131–144 (1966). ↩

  11. Mann, H. B. & Whitney, D. R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Stat. 18, 50–60 (1947). ↩

  12. Bray, J. R. & Curtis, J. T. An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecol. Monogr. 27, 325–349 (1957). ↩

  13. , P. Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull. Société Vaudoise Sci. Nat. 37, 547 (1901). ↩

  14. Anderson, M. J. Permutational Multivariate Analysis of Variance (PERMANOVA). in Wiley StatsRef: Statistics Reference Online 1–15 (John Wiley & Sons, Ltd, 2017). doi:10.1002/9781118445112.stat07841. ↩

  15. Li, S. et al. Alteration in Oral Microbiome Among Men Who Have Sex With Men With Acute and Chronic HIV Infection on Antiretroviral Therapy. Front. Cell. Infect. Microbiol. 11, 695515 (2021). ↩

  16. Yang, L. et al. HIV-induced immunosuppression is associated with colonization of the proximal gut by environmental bacteria. AIDS Lond. Engl. 30, 19–29 (2016). ↩ ↩2 ↩3

  17. Saxena, D. et al. Modulation of the orodigestive tract microbiome in HIV-infected patients. Oral Dis. 22 Suppl 1, 73–78 (2016). ↩ ↩2 ↩3

  18. Lu, X. et al. Gut Microbiome Alterations in Men Who Have Sex with Men-a Preliminary Report. Curr. HIV Res. (2022) doi:10.2174/1570162X20666220908105918. ↩ ↩2

  19. Saeed, N. K., Farid, E. & Jamsheer, A. E. Prevalence of opportunistic infections in HIV-positive patients in Bahrain: a four-year review (2009-2013). J. Infect. Dev. Ctries. 9, 60–69 (2015). ↩

  20. Brito, L. C. N. et al. Microbiologic profile of endodontic infections from HIV- and HIV+ patients using multiple-displacement amplification and checkerboard DNA-DNA hybridization. Oral Dis. 18, 558–567 (2012). ↩

  21. Özen, S. et al. Catheter-related Infections in Pediatric Patients Due to a Rare Pathogen: Herbaspirillum huttiense. Pediatr. Infect. Dis. J. (2024) doi:10.1097/INF.0000000000004350. ↩

  22. Ruiz de Villa, A., Alok, A., Oyetoran, A. E. & Fabara, S. P. Septic Shock and Bacteremia Secondary to Herbaspirillum huttiense: A Case Report and Review of Literature. Cureus 15, e36155 (2023). ↩

  23. Wang, J., Song, Y., Liu, S., Jang, X. & Zhang, L. Persistent bacteremia caused by Ralstonia pickettii and Microbacterium: a case report. BMC Infect. Dis. 24, 327 (2024). ↩ ↩2