Insilico analysis

Goal: Identify features in the meta-genome and 16s data that can aid the design of the interspecific expression system.

Investigate the presence of antibiotic resistance genes
Assess the microbial species abundance
Search for estrogen degradation genes in the MAGs

Useful information:

Manual for using the BioServers is attached to this experiment

Antibiotic resistance genes (ARGs) identification in WWTP samples:

The Comprehensive Antibiotic Resistance Database (CARD; https://card.mcmaster.ca)

ResFinder.

Estrogen degradation pathway:

oecA: https://www.genome.jp/dbget-bin/www_bget?spkc:KC8_09390
oecB: https://www.genome.jp/dbget-bin/www_bget?spkc:KC8_16650
oecC: https://www.genome.jp/dbget-bin/www_bget?spkc:KC8_05325

Proteins of 24000 bacteria genomes in GTDB:

https://data.ace.uq.edu.au/public/misc_downloads/refinem/

Tools:

Diamond: http://www.diamondsearch.org/index.php

MAFFT: https://mafft.cbrc.jp/alignment/software/

trimAI: http://trimal.cgenomics.org/

fasttree: http://www.microbesonline.org/fasttree/

iqtree: http://www.iqtree.org/

GeneTreeTk: https://github.com/dparks1134/GeneTreeTk

Install Conda and change package directory:

module load Miniconda3 mkdir software mkdir software/condaenvs conda config --append envs_dirs /srv/PHN/users/lcv/software/condaenvs

Setup abricate:

conda init bash conda activate abricate conda install -c conda-forge -c bioconda -c defaults abricate

Use abricate in a screen environment

screen -S abricate #start a new screen session called abricate screen -r #reattach screen instance

Start abricate:

abricate --db card --threads 12 /*.fna > output.tab #keep in mind to first create output.txt

Using PANDAS to combine the MAG data from Caitlin with the results from Abricate:

Import of both the MAG data and the output.tab data from abricate:

import pandas as pd df = pd.read_csv('MAG.statistics.STABLEX.20200213.tsv', delimiter='\t') abr = pd.read_csv('output.tab', delimiter='\t') print(abr)

Trim the MAG filename in abr and df datasets:

abr['#FILE'] = abr['#FILE'].map(lambda x: x.lstrip('/srv/PHN/users/cms/for_colleagues/Laura/bins/').rstrip('.fna'))

df['MAG'] = df['MAG'].map(lambda x: x.rstrip('.fa'))

Rename first column of abr to match df:

abr.rename(columns={'#FILE':'MAG'}, inplace=True)

Set index on first column (MAG). This allows to use the function join (for information check: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html):

df = df.set_index('MAG') abr = abr.set_index('MAG')

dfabr2 = df.join(abr, how='outer')

Make a summary of the results and filter only Aalborg West WWTP:

sumar2 = dfabr2[['max_abund','GTDBTax','MiDAS3_6Tax','RESISTANCE']]

AAV = sumar2.filter(like='AalW', axis=0)

Export data:

sumar2 = dfabr2[['max_abund','GTDBTax','MiDAS3_6Tax','RESISTANCE']]

dfabr2.to_excel("outputdfabr2.xlsx") sumar2.to_excel("outputsumar2.xlsx") AAV.to_excel("AAV.xlsx")

Use the R-script for AmpVis2 to obtain abundance data from WWTPs:

Adjust the folder descriptions:

setwd("/Users/Maarten/R") usearch_otutable <- amp_import_usearch(otutab = "/Users/Maarten/R/ASVtable.tsv", sintax = "/Users/Maarten/R/ASVs.R1.midas35.sintax") #from Kasper: use this way of loading metadata for PCA to work

library(openxlsx) metadata <- openxlsx::read.xlsx("/Users/Maarten/R/metadata_BioBank_2019-11-27.xlsx", detectDates = TRUE)`

Script to match the species names from the replication origin table with the list the tree of ARB:

import pandas as pd import re from Bio import AlignIO import numpy as np alignment = AlignIO.read("gtdbtk.bac120.msa.fasta", "fasta") edited = alignment[1, :] repori = pd.read_csv('rep_origins.csv') repori['wordcount'] = repori['Species'].str.count(' ') + 1 length = len(alignment) results = [] for i in range(1 , length): table = alignment[i, :] desc = table.description results.append(desc)

`#print(edited) #dir(edited) #edited.description df = pd.DataFrame(results, columns=['description']) df[['ID','description']] = df["description"].str.split(" ", 1, expand=True) #split of the species name: df['species']= df['description'].str.split('s__').str[1] #make sure there are not NaN values (appeariantly needed to let the script run): df = df.fillna('maarten') #df[df['species'].str.contains('Pseudomonas')] #script to match the species names in repori to the list from the Msa: import numpy as np i = 0 df['MaartenID'] = '0' for i in range(0, len(repori)): repspec = repori.iloc[i,1] rep_ID = repori.iloc[i,0] rep_ID repspec if repori.iloc[i,2] == 1: df['MaartenID'] = np.where(df['species'].str.contains(repspec), rep_ID, df['MaartenID']) else: df['MaartenID'] = np.where(df['species'].str.match(repspec), rep_ID, df['MaartenID'])

#df[df['species'].str.contains('Xanthomonas')]

df = df.replace({'0':None})`