genomicsITER/NanoCLUST

Error executing process > 'get_abundances (1)'

BirgitRijvers opened this issue · 4 comments

Hi, I'm struggling with an error when I try to run NanoCLUST on my own data. The pipeline runs fine when I use the test dataset provided with it. On my own data, the pipeline works until the get_abundances step. This is the same issue as posted issues #32 and #71, and based on those issues the problem seems to have something to do with certain taxa and the database. I'm using the blast 16S ribosomal RNA database, and my reads are obtained from sputum samples.

When I check the work directory the '.command.begin', '.command.out', '.command.trace', and '.exitcode' files are empty, but the 'nanoclust_out.txt' file does contain information (see below the error message).

The error message I get looks like:

Error executing process > 'get_abundances (1)'

Caused by:
Process 'get_abundances (1)' terminated with an error exit status (1)

Command executed [/home/brijvers/programs/NanoCLUST/templates/get_abundance.py]:

#!/usr/bin/env python
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc
import pandas as pd
from functools import reduce
import requests
import json
#https://unipept.ugent.be/apidocs/taxonomy

def get_taxname(tax_id,tax_level):
    tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"}
    tax_level_tag = tags[tax_level]
    #Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github
    if str(tax_id) == "nan":
        tax_id = 1
    
    path = '[http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=](http://api.unipept.ugent.be/api/v1/taxonomy.json?input%5b%5d=)' + str(int(tax_id)) + '&extra=true&names=true'
    complete_tax = requests.get(path).text

    #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
    try:
        name = json.loads(complete_tax)[0][tax_level_tag]
    except:
        name = str(int(tax_id))

    return json.loads(complete_tax)[0][tax_level_tag]

def get_abundance_values(names,paths):
    dfs = []
    for name,path in zip(names,paths):
        data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

        total = sum(data['reads_in_cluster'])
        rel_abundance=[]

        for index,row in data.iterrows():
            rel_abundance.append(row['reads_in_cluster'] / total)
            
        data['rel_abundance'] = rel_abundance
        dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
        data.to_csv("" + name + "_nanoclust_out.txt")

    return dfs

def merge_abundance(dfs,tax_level):
    df_final = reduce(lambda left,right: pd.merge(left,right,on='taxid',how='outer').fillna(0), dfs)
    df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
    df_final_grp = df_final.groupby(["taxid"], as_index=False).sum()
    return df_final_grp

def get_abundance(names,paths,tax_level):
    if(not isinstance(paths, list)):
        paths = [paths]
        names = [names]

    dfs = get_abundance_values(names,paths)
    df_final_grp = merge_abundance(dfs, tax_level)
    df_final_grp.to_csv("rel_abundance_"+ names[0] + "_" + tax_level + ".csv", index = False)

paths = "bc09_10.nanoclust_out.txt"
names = "bc09_10"

get_abundance(names,paths, "G")
get_abundance(names,paths, "S")
get_abundance(names,paths, "O")
get_abundance(names,paths, "F")

Command exit status:
1

Command output:
(empty)

Command error:
Traceback (most recent call last):
  File "/home/brijvers/work/be/9cfc38d891922966c6fca1e2e7f5a4/.command.sh", line 65, in <module>
    get_abundance(names,paths, "G")
  File "/home/brijvers/work/be/9cfc38d891922966c6fca1e2e7f5a4/.command.sh", line 59, in get_abundance
    df_final_grp = merge_abundance(dfs, tax_level)
  File "/home/brijvers/work/be/9cfc38d891922966c6fca1e2e7f5a4/.command.sh", line 49, in merge_abundance
    df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
  File "/home/brijvers/work/be/9cfc38d891922966c6fca1e2e7f5a4/.command.sh", line 49, in <listcomp>
    df_final["taxid"] = [get_taxname(row["taxid"], tax_level) for index, row in df_final.iterrows()]
  File "/home/brijvers/work/be/9cfc38d891922966c6fca1e2e7f5a4/.command.sh", line 28, in get_taxname
    return json.loads(complete_tax)[0][tax_level_tag]
IndexError: list index out of range 

The nanoclust_out.txt file does contain information about the classified reads, and looks like this:

,reads_in_cluster,used_for_consensus,reads_after_corr,draft_id,sciname,taxid,length,per_ident,rel_abundance
0,84,100,37,c0ffd863-a5c0-4ac2-893f-b268894dc174 id=49,,1955272,1414,99.081,0.001077309803519212
1,131,100,38,a12734ca-7201-46f8-9ada-d19f000e03a6 id=89,,287,1497,97.662,0.0016800902888216283
2,59,100,38,65e7f2b7-6991-4d14-9dc0-563d0e17db50 id=11,,760570,1498,98.865,0.0007566818858051608
3,1881,100,39,2ea4d0c5-c550-44ed-822c-4b49bf7cf331 id=77,,1165090,1296,99.76899999999999,0.024124044528805212
4,82,100,38,020a425a-03ab-4c80-aa5c-9eb00d9ed896 id=25,,137732,1513,98.744,0.0010516595701020879
5,62,100,38,2865b0ff-b7de-4b82-886e-2a855e055fdc id=19,,735,1466,98.568,0.000795157235930847
6,149,100,37,f334faa5-1ba8-4935-8237-81b04c411726 id=77,,1955272,1420,92.465,0.001910942389575745
7,86,100,37,d99b1281-35bd-4889-81b8-d5360eadfb6f id=57,,2604832,1535,90.03299999999999,0.0011029600369363362
8,58,100,42,1d1be1a7-bb17-40bb-b30f-d9cc17dccdda id=44,,2912308,1355,99.557,0.0007438567690965988
9,71,100,38,1da2de7d-8165-40f9-a99c-2b2f169f8292 id=9,,29466,1525,98.885,0.0009105832863079054
10,118,100,38,90b63e84-e872-4380-bdc4-a5524dcd22ba id=33,,287,1499,92.995,0.0015133637716103216
11,233,100,38,7f200f5b-414f-42c4-91d3-c70372e42749 id=69,,33053,1483,99.52799999999999,0.002988252193094957
12,90,100,38,bb2e6c4d-9cbb-4dd7-aee0-360b4973b141 id=10,,2912308,1381,95.51,0.0011542605037705843
13,213,100,38,f470cb17-7ae7-461a-8be1-0d89d6cf9326 id=87,,287,1497,95.324,0.0027317498589237163
14,72,100,38,90272b8d-c98e-4ca4-abe0-01434687a5b4 id=28,,39778,1498,99.59899999999999,0.0009234084030164674
15,75,100,38,c32db3f2-516d-44d7-a256-e7a4ad38569b id=15,,1379,1512,99.669,0.0009618837531421536
16,71,100,38,07158be8-7267-44cd-aa27-bd2a39789ee1 id=23,,2912308,1385,91.625,0.0009105832863079054
17,139,100,41,d09a99a7-04cc-4952-bd7b-5df9b741cb81 id=12,,287,1409,99.85799999999999,0.0017826912224901247
18,192,100,38,f33c8e76-006b-41d6-aa15-bd53f2cb9e67 id=38,,28037,1504,99.335,0.002462422408043913
19,129,100,38,1be58b08-9694-4d75-a6a5-26050b61d6cd id=52,,726,1488,98.723,0.0016544400554045042
20,129,100,38,e35c290b-3f86-4b1a-aa57-7cfeb11988d0 id=3,,1458253,1510,99.47,0.0016544400554045042
21,137,100,37,f058adb5-33cb-427e-bd9c-696646895a4e id=42,,1955272,1468,99.046,0.0017570409890730007
22,47308,100,38,249452ea-38bf-4b2b-956b-1a514c18f3bf id=26,,287,1496,99.866,0.6067306212486534
23,160,100,38,5ce746f7-afdf-4523-8376-2a1e974f7e76 id=47,,2819619,1508,99.13799999999999,0.0020520186733699276
24,4268,100,37,fcb00fb4-d60b-4b04-9991-d7731b48eabd id=4,,1955272,1468,98.84200000000001,0.05473759811214282
25,389,100,37,32edc207-4f2a-4400-a1eb-7aeda7b9a275 id=49,,287,1496,99.866,0.004988970399630636
26,4531,100,37,a18cdba5-321f-4a80-81ab-f531732cbbbd id=51,,1955272,1469,98.57,0.05811060380649464
27,2994,100,37,debde32e-1ed1-4924-99fa-487809d29f62 id=77,,1955272,1468,99.387,0.03839839942543477
28,397,100,38,a4d92378-2c19-40bd-833a-4aab0bccf55f id=10,,2819619,1508,99.602,0.005091571333299133
29,131,100,38,a590f244-5c48-4775-b969-c3d3812cb95e id=11,,888828,1503,99.53399999999999,0.0016800902888216283
30,118,100,38,42ec3bdc-d1a0-4b39-a01f-bc3abaac6575 id=55,,287,1501,91.206,0.0015133637716103216
31,12796,100,37,67d2f564-2cff-4882-b930-27ccf520d26d id=30,,1955272,1468,99.046,0.16411019340275995
32,102,100,37,ac369ea8-30c1-43fc-bd0e-a0b03852c0b3 id=15,,287,1530,93.399,0.001308161904273329
33,130,100,38,2d116f20-0a1d-4899-b29f-4a29dc0f366c id=28,,137732,1515,97.162,0.0016672651721130662
34,241,100,37,3e4cbd12-dc2c-4bef-b2ae-ab42f2a63158 id=43,,1955272,1414,96.605,0.0030908531267634536
35,146,100,37,82bf9d2d-b895-4e85-beee-3f60bd595745 id=3,,1955272,1417,94.28399999999999,0.001872467039450059

I would be grateful for any assistance anyone could offer.

i meet same issues

Same issue here. Really hope the devs respond.

@LynnGoodnight @iaposto I have figured out how to fix the error! I changed the 'get_abundances.py' file in the templates directory of NanoCLUST. I removed this on line 22+:

try:
        name = json.loads(complete_tax)[0][tax_level_tag]
    except:
        name = str(int(tax_id))
    return json.loads(complete_tax)[0][tax_level_tag]

And I replaced it with these lines:

path = '[http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]='](http://api.unipept.ugent.be/api/v1/taxonomy.json?input%5b%5d=%27) + str(int(tax_id)) + '&extra=true&names=true'
    complete_tax = requests.get(path).text
    # Check if the list returned by json.loads() is not empty
    tax_list = json.loads(complete_tax)
    if len(tax_list) > 0:
        name = tax_list[0][tax_level_tag]
    else:
        name = str(int(tax_id))
    return name

The error has something to do with certain taxids from NCBI not being in the UniPept database and thus giving an empty response that makes the whole pipeline crash. The new code prevents this crash and gives the taxid(numerical) as classification output. If you want to know which classification belongs to the taxid, just search up the taxid in the NCBI database.

Please let me know if this fix also works for you!

Hey @BirgitRijvers, many thanks for the update and for trying to resolve the issue.

I tried your solution and getting this error, probably need to reformat the path string:

Command error: File ".command.sh", line 23 path = '[http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]='](http://api.unipept.ugent.be/api/v1/taxonomy.json?input%5b%5d=%27) + str(int(tax_id)) + '&extra=true&names=true' ^ SyntaxError: unmatched ']'

EDIT1:
Found it! This worked for me:
path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=(http://api.unipept.ugent.be/api/v1/taxonomy.json?input%5b%5d=%27)' + str(int(tax_id)) + '&extra=true&names=true'

I wonder if the script can be modified to return the taxids only when the original get_taxname function fails. Will try it out.
Thanks again!

EDIT2:
Just added a try/except to return the taxid only when the script fails. Here's the modified get_taxname() function:

def get_taxname(tax_id,tax_level):
    tags = {"S": "species_name","G": "genus_name","F": "family_name","O":'order_name', "C": "class_name"}
    tax_level_tag = tags[tax_level]
    #Avoids pipeline crash due to "nan" classification output. Thanks to Qi-Maria from Github
    if str(tax_id) == "nan":
        tax_id = 1
    
    try:    
        path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=' + str(int(tax_id)) + '&extra=true&names=true'
        complete_tax = requests.get(path).text

        #Checks for API correct response (field containing the tax name). Thanks to devinbrown from Github
        try:
            name = json.loads(complete_tax)[0][tax_level_tag]
        except:
            name = str(int(tax_id))

        return json.loads(complete_tax)[0][tax_level_tag]
    except:
        path = 'http://api.unipept.ugent.be/api/v1/taxonomy.json?input[]=(http://api.unipept.ugent.be/api/v1/taxonomy.json?input%5b%5d=%27)' + str(int(tax_id)) + '&extra=true&names=true'
        complete_tax = requests.get(path).text
        # Check if the list returned by json.loads() is not empty
        tax_list = json.loads(complete_tax)
        if len(tax_list) > 0:
            name = tax_list[0][tax_level_tag]
        else:
            name = str(int(tax_id))
        return name