I. Identifying pks and pks/nrps clusters harboring a potential self-resistance gene. ******************************************************************************** STEP1. Blast 8 KSs against 11 databases ******************************************************************************** Blast 8 diverse KSs against 11 databases and parse blast results with e value < 1 Location: TargetMining/Blast/ folder The blast queries are in Manual/ 1. update 11 databases (as of April 25, 2018, except for wgs – Jan-27-2016) cd /mnt/gnpn/gnpn/projects/orphanpks/blast/db/ Databases: nt, wgs, refseq_genomic, other_genomic, env_nt, patnt, htgs, tsa_nt, sts, gss, est_others perl update_blastdb.pl --passive [database] 2. Download Bacterial Assemblies: wget -L --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 -A genomic.gbff.gz https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/ -P Assemblies/ 2. Run ./run_blast.py specify on line 49 which query sequences to blast against If you want to run blast agains all sequences: blast_query_files = glob.glob("%s/*.seq" % blast_query_dir) 2. Run make all All files stored in TargetMining/Blast/blast_results_seqs/ folder 2.1 Concatenate all parsed blast files into one: make concat creates blast_results_seqs/blast_results.parsed.KS.9f 2.2 Uniqify the proteins and remove dashes and asterisks and choose longest synonym: make longest (have to rm output files of Step #2.4) creates blast_results.parsed.KS.9f.synonyms blast_results.parsed.KS.9f.longest_synonym 2.3. Compare the Reference set to the found set: make compare creates blast_results.parsed.KS.9f.longest_synonym.comparemibig 2.4. Make fasta file of blast hit sequences to use in HMM search and clean the names: make fasta creates blast_results.KS.fasta blast_results.KS.fasta.cleanName 2.5. Remove redundant sequences (cdhit cutoff = 0.99): make cdhit creates blast_results.KS.fasta.cleanName.cdhit.99 blast_results.KS.fasta.cleanName.cdhit.99.clstr 199 894 proteins; 110174 NCBI nucleotide records/genomes ******************************************************************************** STEP2. Fetch all genbank ids that contain a KS ******************************************************************************** 1. make Genbank dir: mkdir GenBank mkdir Genbank/gbdir cp get_gbids.py from oldTargetMining cp fetch_gbids.py from oldTargetMining 2. Run ./get_gbid.py > gbids.txt cat gbids.txt |wc 162984 162984 1918063 cat gbids.txt |sort -u > gbids.unique.94516.txt cat gbids.unique.94516.txt |wc 94516 94516 1141060 3. Run fetch_gbid.py Genbank/gbdir$ lr |wc 94517 850646 6623003 ******************************************************************************** STEP3. Run Antismash on all 95k genbank ids ******************************************************************************** Antismash_gbids/ Run run_antismash.Agbids.py Step 1: BLAST search for KS homologs. Using 8 diverse KS query sequences, e-value < 1 199 894 proteins; 110174 NCBI nucleotide records/genomes Step 2: Fetch all genbank ids that contain a KS Genbank/gbdir (89449 gbids) Genbank/assembly_gb (21080 gbids) STEP3. Run Antismash on all 89k genbank ids and 21k assembly ids Some assembly gb files split into smaller files (26575 assembly files total) STEP4. Extract gene sequences from all pks clusters found by antismash cluster_genes.21k.fasta KS.21k.fasta cluster_genes.89k.fasta KS.89k.fasta 78 clusters don’t have predicted KS domains (details_data empty) cat cluster_genes.all.fasta | grep ">" |cut -d"|" -f1,2 | sort -u |wc 29987 29987 692154 STEP5. Make blast database from cluster_genes.21k.fasta and cluster_genes.89k.fasta 244 196 sequences 664 336 sequences STEP6. Blast 12 targets genes against antismash database from 110k genbank ids 4404 hits. STEP7. Filter blast hits by evalue <1e-8 and identity > 0.3 (and FabB/F identity > 0.6) 806 hits. STEP8. Require both KS+ target <10kb apart (and 5kb apart) 252 hits (and 136 hits) STEP9. add a FabF and cdhit.90 152 hits. ________________________________________________________________________________ Mine 92 putative targets, start from Step6: STEP6. Blast 92 targets genes against antismash database from 110k genbank ids 6.1. ./run_blastp.py with targets.92.fa 6.2. cat out.targets.92* > out.92 31944 hits. STEP7. Filter blast hits by evalue <1e-8 and identity > 0.3 (and "FAB" identity > 0.6) 5492 hits. STEP8. Require both KS+ target <5kb apart. Run ../Antismash_gbids/target_ks_tandem.py ./target_ks_tandem.py > KS.92.5kb.fasta ./target_ks_tandem.py > KS.92.5kb.fasta ./target_ks_tandem.py > target_ks_tandem.92.5kb.out 501 hits. ********************************************************************************** II. PHYLOGENY Step Build a phylogenetic tree and look for homologs 2. ./get_phyla 3. Add FabF 7. Make a ref set file: mibig_refset.10.mod for targrts.12 and targets.92 (new targets in non-pks or non-bacterial clusters, no fungal KS in blast step) 8. Make descr files 9. Get target names Remove redundant sequences: Run cdhit with cutoff .90 Make Multiple sequence alignment using mafft Make a phylogram using FastTree Visualize tree using APE package in R root tree with E. coli withFabF color tree by targrets ********************************************************************************* III. COEVOLUTION STEP 1. (Step 8 from Blast step) Run ../Antismash_gbids/target_ks_tandem.py Output: target_ks_tandem.5kb.out (408 hits) KS.92.5kb.fasta targets.92.5kb.fasta 2. Make blast databases for KS and targets and move to blastdb/ makeblastdb -in KS.92.5kb.fasta -dbtype prot -out KS.92.5kb makeblastdb -in targets.92.5kb.fasta -dbtype prot -out targets.92.5kb makeblastdb -in KS.92.10kb.fasta -dbtype prot -out KS.92.10kb makeblastdb -in targets.92.10kb.fasta -dbtype prot -out targets.92.10kb 3. Pairwise blastp blastp -db blastdb/KS.92.5kb -query KS.92.5kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out KS.92.5kb.out blastp -db blastdb/targets.92.5kb -query targets.92.5kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out targets.92.5kb.out blastp -db blastdb/KS.92.10kb -query KS.92.10kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out KS.92.10kb.out blastp -db blastdb/targets.92.10kb -query targets.92.10kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out targets.92.10kb.out 4. Parse pairwise blast parse_pairwise_identities.py Output file: pairwise_identities.[number of targets].[distance cutoff]kb.out pairwise_identities.92.5kb.out 5. Plot pairwise identities plot_pairwise_identities.py Output file: 12.5kb.png ________________________________________________________________________________ Mine 609 E. coli essential genes Step6: STEP6. Blast 609 targets genes against antismash database from 110k genbank ids 6.1. ./run_blastp.py with targets.609.fa March-6-2019 4:20pm RUNNING 6.2. cat out.targets.609* > out.609 133958 hits. STEP7. Filter blast hits by evalue <1e-8 and identity > 0.3 (and "FAB" identity > 0.6) 28543 hits. STEP8. Require both KS+ target <5kb apart. Run ../Antismash_gbids/target_ks_tandem.py ./target_ks_tandem.py > KS.609.5kb.fasta ./target_ks_tandem.py > target.609.5kb.fasta ./target_ks_tandem.py > target_ks_tandem.609.5kb.out 3238 hits. ********************************************************************************** II. PHYLOGENY Step Build a phylogenetic tree and look for homologs 2. ./get_phyla 3. Add FabF 7. Make a ref set file: mibig_refset.10.mod for targrts.12 and targets.92 (new targets in non-pks or non-bacterial clusters, no fungal KS in blast step) 8. Make descr files (with species and phyla names) 9. Get target names Remove redundant sequences: Run cdhit with cutoff .90 Make Multiple sequence alignment using mafft Make a phylogram using FastTree There is a redundant KS sequence, because there are two copies of the target in that cluster >LT840184|DEG10180179_Malonyl_CoA-acyl_carrier_protein_transacylase_(EC_2.3.1.39)|2950237|2950660|2948904|2950187|cluster-1|transatpks-nrps|2916063-2977676|50 IAIVGIAVKLPLADTVEQFANNLKTGRDCVRPIPSLRKQDTDLYFKQMGLEPEDLAYGEAAYLDEIDKFDYSFFKLSPREASLLDPNQRLFLETAWRAIEDAGYGGGKLGGSPTGVYVGYGSDADYLKLIRQVEPEAVSMSMAGNVRPIIASRLSYLMDLRGPSFIVDSTCSSSLVAVHLACQAIRNGECDSAIVGGTQLHLIPIREYEVGIESSTSRARTFDDRADGTGTGEGVVAMMLKPLEQAIESRDHIYAVIKSSALNQDGGSVGITAPNAEAQEAVIADAWKRAGIDPETIGYIETHGTGTKLGDPIEVEGLKRAFRRFTDKRQFCAIGALKSNIGHLDNTAGIAGLLKAVLSLKNKCIYPTLHFDRPNRVIDFAESPVYVNDKLMEWKSGPHPRRCGVSSFGISGTNCHVILEEAP >LT840184|DEG10180179_Malonyl_CoA-acyl_carrier_protein_transacylase_(EC_2.3.1.39)|4505173|4505599|4509851|4511098|cluster-1|transatpks-nrps|4480334-4564302|4252 IAIIGIDAKIGSAKNVEELWDYLSNGFDLIRDFPVERWKDANQFYKLKFNRQLPEELTPCSYIDRLDMFDAGFFQLSPAEAELMEPAQRLFLESAWTALEDSGYGKGILNGSKTGVFLGYNNPQNPYNMVVEETDNYMYGVAVSGNVDAIIASRISYFLNLKGPAVNVDTSCSSSLVAVHLACQQIRDGEVSMALAGSVRLKTLPPHQHGKKMGIESSSGRTKTFDDRADGTGIGEGVICIVLKSLRQAVRDRDNIYAVIKGSSINQDGASVGITAPNAAAQEEVIKAAWKDAGINPETITYIEAHGTATNLGDPVEINGIERAFGSFTDKKQFCAIGSVKSNAGHLDCASGLAGLVKAILMLKHKQLLPTLHFEVPNRKINFISSPVYINDKLVPWETESGSRRCGVSSFGMSGTNCHLVLEEAP 1973 Manually remove the more distant one Visualize tree using APE package in R root tree with E. coli withFabF color tree by targrets ********************************************************************************* III. COEVOLUTION STEP 1. (Step 8 from Blast step) Run ../Antismash_gbids/target_ks_tandem.py Output: target_ks_tandem.5kb.out (408 hits) KS.92.5kb.fasta targets.92.5kb.fasta 2. Make blast databases for KS and targets and move to blastdb/ makeblastdb -in KS.92.5kb.fasta -dbtype prot -out KS.92.5kb makeblastdb -in targets.92.5kb.fasta -dbtype prot -out targets.92.5kb makeblastdb -in KS.92.10kb.fasta -dbtype prot -out KS.92.10kb makeblastdb -in targets.92.10kb.fasta -dbtype prot -out targets.92.10kb 3. Pairwise blastp blastp -db blastdb/KS.92.5kb -query KS.92.5kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out KS.92.5kb.out blastp -db blastdb/targets.92.5kb -query targets.92.5kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out targets.92.5kb.out blastp -db blastdb/KS.92.10kb -query KS.92.10kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out KS.92.10kb.out blastp -db blastdb/targets.92.10kb -query targets.92.10kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out targets.92.10kb.out 4. Parse pairwise blast parse_pairwise_identities.py Output file: pairwise_identities.[number of targets].[distance cutoff]kb.out pairwise_identities.92.5kb.out 5. Plot pairwise identities plot_pairwise_identities.py Output file: 12.5kb.png