
Create a database (DB) from a custom database.

Opened this issue · 0 comments

Good morning! I need your help!
From bibi's database "leBIBI IV SSU-rDNA (16S) Automated ProKaryotes Phylogeny," I've tried to generate the necessary data for NanoCLUST to be able to use them when performing the analysis. I've used programs like BLAST+ 2.13.0 (makeblastdb) to try to obtain the following extensions: .ndb, .nhr, .nin, .nnd, .nni, .nog, .nos, .not, .nsq, .ntf, .nto, but there are always two extensions that don't appear: .nnd and .nni.
When I run the program, I get the following error:

(Nextflow) cnr-strep@cnrstrep-Precision-3660:~/NanoCLUST$ nextflow run -profile docker --reads '/media/cnr-strep/ACC22AE1C22AB00E/FastqHAC-Lactobacillus/FastQ_Bichat/Fastq-HAC-16052022/barcode17/trimming/barcode17.filtered.fastq' --db 'db/16S_ribosomal_RNA' --tax 'db/taxdb/'
N E X T F L O W ~ version 22.10.6
Launching [determined_ampere] DSL1 - revision: 2a51687d92

  _   __                     ________    __  _____________
 / | / /___ _____  ____     / ____/ /   / / / / ___/_  __/
/  |/ / __ `/ __ \/ __ \   / /   / /   / / / /\__ \ / /   

/ /| / // / / / / // / / // // // // // /
/ |
/_,// //_/ _/__/_//___///

NanoCLUST v1.0dev

Run Name : determined_ampere
Reads : /media/cnr-strep/ACC22AE1C22AB00E/FastqHAC-Lactobacillus/FastQ_Bichat/Fastq-HAC-16052022/barcode17/trimming/barcode17.filtered.fastq
Max Resources : 128 GB memory, 16 cpus, 10d time per job
Container : docker - [:]
Output dir : ./results
Launch dir : /home/cnr-strep/NanoCLUST
Working dir : /home/cnr-strep/NanoCLUST/work
Script dir : /home/cnr-strep/NanoCLUST
User : cnr-strep
Config Profile : docker

executor > local (23)
[8b/15691c] process > QC (1) [100%] 1 of 1 ✔
[5e/36346c] process > fastqc (1) [100%] 1 of 1 ✔
[3c/3e9715] process > kmer_freqs (1) [100%] 1 of 1 ✔
[26/7c4085] process > read_clustering (1) [100%] 1 of 1 ✔
[3c/d03ee2] process > split_by_cluster (1) [100%] 1 of 1 ✔
[96/2c6b76] process > read_correction (3) [100%] 3 of 3 ✔
[bb/f3035a] process > draft_selection (3) [100%] 3 of 3 ✔
[21/1e894c] process > racon_pass (3) [100%] 3 of 3 ✔
[bf/28314d] process > medaka_pass (3) [100%] 3 of 3 ✔
[90/ad3c87] process > consensus_classification (3) [100%] 3 of 3 ✔
[07/d23aa0] process > join_results (1) [100%] 1 of 1 ✔
[4f/f929af] process > get_abundances (1) [ 0%] 0 of 1
[- ] process > plot_abundances -
[fe/e2e60d] process > output_documentation [100%] 1 of 1 ✔
Error executing process > 'get_abundances (1)'

Caused by:
Process get_abundances (1) terminated with an error exit status (1)

Command executed [/home/cnr-strep/NanoCLUST/templates/]:

#!/usr/bin/env python

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
import pandas as pd
from functools import reduce
import requests
import json

def get_taxname(tax_id,tax_level):
tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"}
tax_level_tag = tags[tax_level]
#Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github
if str(tax_id) == "nan":
tax_id = 1

  path = '[]=' + str(int(tax_id)) + '&extra=true&names=true'
  complete_tax = requests.get(path).text

  #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
      name = json.loads(complete_tax)[0][tax_level_tag]
      name = str(int(tax_id))

  return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths):
dfs = []
for name,path in zip(names,paths):
data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

      total = sum(data['reads_in_cluster'])

      for index,row in data.iterrows():
          rel_abundance.append(row['reads_in_cluster'] / total)
      data['rel_abundance'] = rel_abundance
      dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
      data.to_csv("" + name + "_nanoclust_out.txt")

executor > local (23)
[8b/15691c] process > QC (1) [100%] 1 of 1 ✔
[5e/36346c] process > fastqc (1) [100%] 1 of 1 ✔
[3c/3e9715] process > kmer_freqs (1) [100%] 1 of 1 ✔
[26/7c4085] process > read_clustering (1) [100%] 1 of 1 ✔
[3c/d03ee2] process > split_by_cluster (1) [100%] 1 of 1 ✔
[96/2c6b76] process > read_correction (3) [100%] 3 of 3 ✔
[bb/f3035a] process > draft_selection (3) [100%] 3 of 3 ✔
[21/1e894c] process > racon_pass (3) [100%] 3 of 3 ✔
[bf/28314d] process > medaka_pass (3) [100%] 3 of 3 ✔
[90/ad3c87] process > consensus_classification (3) [100%] 3 of 3 ✔
[07/d23aa0] process > join_results (1) [100%] 1 of 1 ✔
[4f/f929af] process > get_abundances (1) [100%] 1 of 1, failed: 1 ✘
[- ] process > plot_abundances -
[fe/e2e60d] process > output_documentation [100%] 1 of 1 ✔
Execution cancelled -- Finishing pending tasks before exit
[nf-core/nanoclust] Pipeline completed with errors
Error executing process > 'get_abundances (1)'

Caused by:
Process get_abundances (1) terminated with an error exit status (1)

Command executed [/home/cnr-strep/NanoCLUST/templates/]:

#!/usr/bin/env python

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
import pandas as pd
from functools import reduce
import requests
import json

def get_taxname(tax_id,tax_level):
tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"}
tax_level_tag = tags[tax_level]
#Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github
if str(tax_id) == "nan":
tax_id = 1

  path = '[]=' + str(int(tax_id)) + '&extra=true&names=true'
  complete_tax = requests.get(path).text

  #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
      name = json.loads(complete_tax)[0][tax_level_tag]
      name = str(int(tax_id))

  return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths):
dfs = []
for name,path in zip(names,paths):
data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

      total = sum(data['reads_in_cluster'])

      for index,row in data.iterrows():
          rel_abundance.append(row['reads_in_cluster'] / total)
      data['rel_abundance'] = rel_abundance
      dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
      data.to_csv("" + name + "_nanoclust_out.txt")

executor > local (23)
[8b/15691c] process > QC (1) [100%] 1 of 1 ✔
[5e/36346c] process > fastqc (1) [100%] 1 of 1 ✔
[3c/3e9715] process > kmer_freqs (1) [100%] 1 of 1 ✔
[26/7c4085] process > read_clustering (1) [100%] 1 of 1 ✔
[3c/d03ee2] process > split_by_cluster (1) [100%] 1 of 1 ✔
[96/2c6b76] process > read_correction (3) [100%] 3 of 3 ✔
[bb/f3035a] process > draft_selection (3) [100%] 3 of 3 ✔
[21/1e894c] process > racon_pass (3) [100%] 3 of 3 ✔
[bf/28314d] process > medaka_pass (3) [100%] 3 of 3 ✔
[90/ad3c87] process > consensus_classification (3) [100%] 3 of 3 ✔
[07/d23aa0] process > join_results (1) [100%] 1 of 1 ✔
[4f/f929af] process > get_abundances (1) [100%] 1 of 1, failed: 1 ✘
[- ] process > plot_abundances -
[fe/e2e60d] process > output_documentation [100%] 1 of 1 ✔
Execution cancelled -- Finishing pending tasks before exit
[nf-core/nanoclust] Pipeline completed with errors
WARN: Graphviz is required to render the execution DAG in the given format -- See for more info.
Error executing process > 'get_abundances (1)'

Caused by:
Process get_abundances (1) terminated with an error exit status (1)

Command executed [/home/cnr-strep/NanoCLUST/templates/]:

#!/usr/bin/env python

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
import pandas as pd
from functools import reduce
import requests
import json

def get_taxname(tax_id,tax_level):
tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"}
tax_level_tag = tags[tax_level]
#Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github
if str(tax_id) == "nan":
tax_id = 1

  path = '[]=' + str(int(tax_id)) + '&extra=true&names=true'
  complete_tax = requests.get(path).text

  #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
      name = json.loads(complete_tax)[0][tax_level_tag]
      name = str(int(tax_id))

  return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths):
dfs = []
for name,path in zip(names,paths):
data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

      total = sum(data['reads_in_cluster'])

      for index,row in data.iterrows():
          rel_abundance.append(row['reads_in_cluster'] / total)
      data['rel_abundance'] = rel_abundance
      dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
      data.to_csv("" + name + "_nanoclust_out.txt")

  return dfs

def merge_abundance(dfs,tax_level):
df_final = reduce(lambda left,right: pd.merge(left,right,on='taxid',how='outer').fillna(0), dfs)
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
df_final_grp = df_final.groupby(["taxid"], as_index=False).sum()
return df_final_grp

def get_abundance(names,paths,tax_level):
if(not isinstance(paths, list)):
paths = [paths]
names = [names]

  dfs = get_abundance_values(names,paths)
  df_final_grp = merge_abundance(dfs, tax_level)
  df_final_grp.to_csv("rel_abundance_"+ names[0] + "_" + tax_level + ".csv", index = False)

paths = "barcode17.filtered.nanoclust_out.txt"
names = "barcode17.filtered"

get_abundance(names,paths, "G")
get_abundance(names,paths, "S")
get_abundance(names,paths, "O")
get_abundance(names,paths, "F")

Command exit status:

Command output:

Command error:
Traceback (most recent call last):
File "", line 65, in
get_abundance(names,paths, "G")
File "", line 59, in get_abundance
df_final_grp = merge_abundance(dfs, tax_level)
File "", line 49, in merge_abundance
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File "", line 49, in
df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
File "", line 28, in get_taxname
return json.loads(complete_tax)[0][tax_level_tag]
IndexError: list index out of range

Work dir:

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out
(Nextflow) cnr-strep@cnrstrep-Precision-3660:~/NanoCLUST$

Please, could you guide me on how to generate a database that can be interpreted by NanoCLUST from a FASTA file containing a list of selected 16S sequences?

Thank you very much!

Miguel Angel Hernandez