I. Identifying pks and pks/nrps clusters harboring a potential self-resistance
gene.

********************************************************************************
STEP1. Blast 8 KSs against 11 databases
********************************************************************************


Blast 8 diverse KSs against 11 databases and parse blast results with e value < 1
Location: TargetMining/Blast/ folder
The blast queries are in Manual/

1. update 11 databases (as of April 25, 2018, except for wgs – Jan-27-2016)
    cd /mnt/gnpn/gnpn/projects/orphanpks/blast/db/
    Databases: nt, wgs, refseq_genomic, other_genomic, env_nt, patnt, htgs, tsa_nt, sts, gss, est_others
    perl update_blastdb.pl --passive [database]

2. Download Bacterial Assemblies:
    wget -L --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 -A genomic.gbff.gz https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/  -P Assemblies/


2. Run ./run_blast.py
    specify on line 49 which query sequences to blast against
    If you want to run blast agains all sequences:
    blast_query_files = glob.glob("%s/*.seq" % blast_query_dir)

2. Run make all

    All files stored in TargetMining/Blast/blast_results_seqs/ folder

    2.1 Concatenate all parsed blast files into one: make concat
    creates blast_results_seqs/blast_results.parsed.KS.9f

    2.2 Uniqify the proteins and remove dashes and asterisks and choose longest
    synonym: make longest (have to rm output files of Step #2.4)
    creates blast_results.parsed.KS.9f.synonyms
            blast_results.parsed.KS.9f.longest_synonym

    2.3. Compare the Reference set to the found set: make compare
    creates blast_results.parsed.KS.9f.longest_synonym.comparemibig

    2.4.  Make fasta file of blast hit sequences to use in HMM search and clean the
    names: make fasta
    creates blast_results.KS.fasta
            blast_results.KS.fasta.cleanName

    2.5. Remove redundant sequences (cdhit cutoff = 0.99): make cdhit
    creates blast_results.KS.fasta.cleanName.cdhit.99
            blast_results.KS.fasta.cleanName.cdhit.99.clstr

    199 894 proteins;
 110174 NCBI nucleotide records/genomes
********************************************************************************
STEP2. Fetch all genbank ids that contain a KS
********************************************************************************

1. make Genbank dir:
 mkdir GenBank
 mkdir Genbank/gbdir
 cp get_gbids.py from oldTargetMining
 cp fetch_gbids.py from oldTargetMining

2. Run ./get_gbid.py > gbids.txt

    cat gbids.txt |wc
    162984  162984 1918063

    cat gbids.txt |sort -u > gbids.unique.94516.txt
    cat gbids.unique.94516.txt |wc
    94516   94516 1141060

3. Run fetch_gbid.py
    Genbank/gbdir$ lr |wc
    94517  850646 6623003

********************************************************************************
STEP3. Run Antismash on all 95k genbank ids
********************************************************************************

       Antismash_gbids/
       Run run_antismash.Agbids.py


Step 1: BLAST search for KS homologs.
Using 8 diverse KS query sequences, e-value < 1
199 894 proteins;
 110174 NCBI nucleotide records/genomes

Step 2: Fetch all genbank ids that contain a KS
       Genbank/gbdir (89449 gbids)
       Genbank/assembly_gb (21080 gbids)

STEP3. Run Antismash on all 89k genbank ids and 21k assembly ids
	Some assembly gb files split into smaller files (26575 assembly files total)

STEP4. Extract gene sequences from all pks clusters found by antismash
    cluster_genes.21k.fasta
    KS.21k.fasta
    cluster_genes.89k.fasta
    KS.89k.fasta
78 clusters don’t have predicted KS domains (details_data empty)
cat cluster_genes.all.fasta | grep ">" |cut -d"|" -f1,2 | sort -u |wc
  29987   29987  692154


STEP5. Make blast database from cluster_genes.21k.fasta and cluster_genes.89k.fasta
244 196 sequences
664 336 sequences

STEP6. Blast 12 targets genes against antismash database from 110k genbank ids
4404 hits.

STEP7. Filter blast hits by evalue <1e-8 and identity > 0.3 (and FabB/F identity > 0.6)
806 hits.

STEP8. Require both KS+ target <10kb apart (and 5kb apart)
252 hits (and 136 hits)

STEP9. add a FabF and cdhit.90
152 hits.
________________________________________________________________________________
Mine 92 putative targets, start from Step6:

STEP6. Blast 92 targets genes against antismash database from 110k genbank ids

        6.1. ./run_blastp.py with targets.92.fa
        6.2. cat out.targets.92* > out.92

31944 hits.

STEP7. Filter blast hits by evalue <1e-8 and identity > 0.3 (and "FAB" identity > 0.6)
5492 hits.

STEP8. Require both KS+ target <5kb apart.
       Run ../Antismash_gbids/target_ks_tandem.py
      ./target_ks_tandem.py > KS.92.5kb.fasta
      ./target_ks_tandem.py > KS.92.5kb.fasta
      ./target_ks_tandem.py > target_ks_tandem.92.5kb.out
501 hits.

**********************************************************************************
II. PHYLOGENY Step
Build a phylogenetic tree and look for homologs

2. ./get_phyla
3. Add FabF
7. Make a ref set file: mibig_refset.10.mod for targrts.12 and targets.92 (new targets in non-pks or non-bacterial clusters, no fungal KS in blast step)
8. Make descr files
9. Get target names

Remove redundant sequences: Run cdhit with cutoff .90
Make Multiple sequence alignment using mafft
Make a phylogram using FastTree
Visualize tree using APE package in R
  root tree with E. coli withFabF
  color tree by targrets
*********************************************************************************

III. COEVOLUTION STEP

1. (Step 8 from Blast step) Run ../Antismash_gbids/target_ks_tandem.py
Output: target_ks_tandem.5kb.out (408 hits)
        KS.92.5kb.fasta
        targets.92.5kb.fasta

2. Make blast databases for KS and targets and move to blastdb/
    makeblastdb -in KS.92.5kb.fasta -dbtype prot -out KS.92.5kb
    makeblastdb -in targets.92.5kb.fasta -dbtype prot -out targets.92.5kb

    makeblastdb -in KS.92.10kb.fasta -dbtype prot -out KS.92.10kb
    makeblastdb -in targets.92.10kb.fasta -dbtype prot -out targets.92.10kb

3. Pairwise blastp
blastp -db blastdb/KS.92.5kb -query KS.92.5kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out KS.92.5kb.out
blastp -db blastdb/targets.92.5kb -query targets.92.5kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out targets.92.5kb.out

blastp -db blastdb/KS.92.10kb -query KS.92.10kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out KS.92.10kb.out
blastp -db blastdb/targets.92.10kb -query targets.92.10kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out targets.92.10kb.out


4. Parse pairwise blast
parse_pairwise_identities.py
Output file: pairwise_identities.[number of targets].[distance cutoff]kb.out
             pairwise_identities.92.5kb.out

5. Plot pairwise identities
plot_pairwise_identities.py
Output file: 12.5kb.png



________________________________________________________________________________
Mine 609 E. coli essential genes Step6:

STEP6. Blast 609 targets genes against antismash database from 110k genbank ids

        6.1. ./run_blastp.py with targets.609.fa March-6-2019 4:20pm RUNNING
        6.2. cat out.targets.609* > out.609

133958 hits.

STEP7. Filter blast hits by evalue <1e-8 and identity > 0.3 (and "FAB" identity > 0.6)
28543 hits.

STEP8. Require both KS+ target <5kb apart.
       Run ../Antismash_gbids/target_ks_tandem.py
      ./target_ks_tandem.py > KS.609.5kb.fasta
      ./target_ks_tandem.py > target.609.5kb.fasta
      ./target_ks_tandem.py > target_ks_tandem.609.5kb.out
3238 hits.

**********************************************************************************
II. PHYLOGENY Step
Build a phylogenetic tree and look for homologs

2. ./get_phyla
3. Add FabF
7. Make a ref set file: mibig_refset.10.mod for targrts.12 and targets.92 (new targets in non-pks or non-bacterial clusters, no fungal KS in blast step)
8. Make descr files (with species and phyla names)
9. Get target names

Remove redundant sequences: Run cdhit with cutoff .90
Make Multiple sequence alignment using mafft
Make a phylogram using FastTree

There is a redundant KS sequence, because there are two copies of the target in that cluster
>LT840184|DEG10180179_Malonyl_CoA-acyl_carrier_protein_transacylase_(EC_2.3.1.39)|2950237|2950660|2948904|2950187|cluster-1|transatpks-nrps|2916063-2977676|50
IAIVGIAVKLPLADTVEQFANNLKTGRDCVRPIPSLRKQDTDLYFKQMGLEPEDLAYGEAAYLDEIDKFDYSFFKLSPREASLLDPNQRLFLETAWRAIEDAGYGGGKLGGSPTGVYVGYGSDADYLKLIRQVEPEAVSMSMAGNVRPIIASRLSYLMDLRGPSFIVDSTCSSSLVAVHLACQAIRNGECDSAIVGGTQLHLIPIREYEVGIESSTSRARTFDDRADGTGTGEGVVAMMLKPLEQAIESRDHIYAVIKSSALNQDGGSVGITAPNAEAQEAVIADAWKRAGIDPETIGYIETHGTGTKLGDPIEVEGLKRAFRRFTDKRQFCAIGALKSNIGHLDNTAGIAGLLKAVLSLKNKCIYPTLHFDRPNRVIDFAESPVYVNDKLMEWKSGPHPRRCGVSSFGISGTNCHVILEEAP
>LT840184|DEG10180179_Malonyl_CoA-acyl_carrier_protein_transacylase_(EC_2.3.1.39)|4505173|4505599|4509851|4511098|cluster-1|transatpks-nrps|4480334-4564302|4252
IAIIGIDAKIGSAKNVEELWDYLSNGFDLIRDFPVERWKDANQFYKLKFNRQLPEELTPCSYIDRLDMFDAGFFQLSPAEAELMEPAQRLFLESAWTALEDSGYGKGILNGSKTGVFLGYNNPQNPYNMVVEETDNYMYGVAVSGNVDAIIASRISYFLNLKGPAVNVDTSCSSSLVAVHLACQQIRDGEVSMALAGSVRLKTLPPHQHGKKMGIESSSGRTKTFDDRADGTGIGEGVICIVLKSLRQAVRDRDNIYAVIKGSSINQDGASVGITAPNAAAQEEVIKAAWKDAGINPETITYIEAHGTATNLGDPVEINGIERAFGSFTDKKQFCAIGSVKSNAGHLDCASGLAGLVKAILMLKHKQLLPTLHFEVPNRKINFISSPVYINDKLVPWETESGSRRCGVSSFGMSGTNCHLVLEEAP


1973
Manually remove the more distant one

Visualize tree using APE package in R
  root tree with E. coli withFabF
  color tree by targrets
*********************************************************************************

III. COEVOLUTION STEP

1. (Step 8 from Blast step) Run ../Antismash_gbids/target_ks_tandem.py
Output: target_ks_tandem.5kb.out (408 hits)
        KS.92.5kb.fasta
        targets.92.5kb.fasta

2. Make blast databases for KS and targets and move to blastdb/
    makeblastdb -in KS.92.5kb.fasta -dbtype prot -out KS.92.5kb
    makeblastdb -in targets.92.5kb.fasta -dbtype prot -out targets.92.5kb

    makeblastdb -in KS.92.10kb.fasta -dbtype prot -out KS.92.10kb
    makeblastdb -in targets.92.10kb.fasta -dbtype prot -out targets.92.10kb

3. Pairwise blastp
blastp -db blastdb/KS.92.5kb -query KS.92.5kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out KS.92.5kb.out
blastp -db blastdb/targets.92.5kb -query targets.92.5kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out targets.92.5kb.out

blastp -db blastdb/KS.92.10kb -query KS.92.10kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out KS.92.10kb.out
blastp -db blastdb/targets.92.10kb -query targets.92.10kb.fasta -outfmt "6 qseqid sseqid sstart send nident qlen slen evalue" -evalue 1 -out targets.92.10kb.out


4. Parse pairwise blast
parse_pairwise_identities.py
Output file: pairwise_identities.[number of targets].[distance cutoff]kb.out
             pairwise_identities.92.5kb.out

5. Plot pairwise identities
plot_pairwise_identities.py
Output file: 12.5kb.png