Error: AttributeError: 'DiGraph' object has no attribute 'node'
nextgenusfs opened this issue · 3 comments
Hi @ksahlin. Thanks a ton for all of your tools with noisy reads. I'm looking for a solution for de novo clustering of ONT amplicon reads from environmental sequencing, ie fungal rRNA amplicons. The data I'm trying this on is from a mock community of mixed species. The region is the ITS-LSU region of rRNA in fungi -- we typically define species with a 97% pident cutoff with this region. The data has been pre-processed by re-orienting reads into the same direction and finding/trimming both forward and reverse primer sequences.
I've tried isONclust and at first it seemed like it might be working great (and quite fast), but then on further inspection it was a little too liberal on clustering the data that I have access to at the moment, effectively combining too many reads into the same "gene family". I ran a parameter search by varying k and w to see if I could get it to give me the proper results, but essentially never got a set of parameters that could delineate the clusters properly. My goal is to find a method to identify the "centroid", as then it is relatively straightforward to use spoa
and racon/medaka
for error correction. I tried to clean up the clustering little bit by invoking a "sub clustering" by plotting read lengths (as fungal ITS-LSU sequences are variable in length) and then pulling out "peaks" from the lengths of reads -- this seemed to work okay, but still not quite what I'm looking for.
Based on some of the other issues in your tool repositories, I then tried IsoCon
which you had indicated seemed to be a more general approach. IsoCon
has a much much longer runtime and then eventually crashed with the error below (note I ran it initially without --prefilter_candidates --min_candidate_support 2
and it crashed with same error).
If you have any other suggestions on an appropriate workflow I'd be grateful to hear your opinions.
Thanks,
Jon
$ IsoCon pipeline -fl_reads reads.oriented.proper-primers.yacrd.fastq -outfolder isocon_test2 --verbose --prefilter_candidates --min_candidate_support 8 --nr_cores 7
fl_reads: reads.oriented.proper-primers.yacrd.fastq
outfolder: isocon_test2
ccs: None
nr_cores: 7
verbose: True
neighbor_search_depth: 4294967296
min_exon_diff: 20
min_candidate_support: 8
p_value_threshold: 0.01
min_test_ratio: 5
max_phred_q_trusted: 43
ignore_ends_len: 15
cleanup: False
prefilter_candidates: True
which: pipeline
is_fastq: True
ITERATION: 1
Max transcript length:2694, Min transcript length:806
Non-converged (unique) sequences left: 67501
[0, 964, 1928, 2892, 3856, 4820, 5784, 6748, 7712, 8676, 9640, 10604, 11568, 12532, 13496, 14460, 15424, 16388, 17352, 18316, 19280, 20244, 21208, 22172, 23136, 24100, 25064, 26028, 26992, 27956, 28920, 29884, 30848, 31812, 32776, 33740, 34704, 35668, 36632, 37596, 38560, 39524, 40488, 41452, 42416, 43380, 44344, 45308, 46272, 47236, 48200, 49164, 50128, 51092, 52056, 53020, 53984, 54948, 55912, 56876, 57840, 58804, 59768, 60732, 61696, 62660, 63624, 64588, 65552, 66516, 67480]
query chunks: [964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 21]
processing 0
processing 14500
processing 3000
processing 17500
processing 6000
processing 500
processing 9000
processing 12000
processing 15000
processing 3500
processing 18000
processing 6500
processing 1000
processing 9500
processing 12500
processing 15500
processing 4000
processing 7000
processing 18500
processing 10000
processing 1500
processing 13000
processing 4500
processing 16000
processing 7500
processing 10500
processing 2000
processing 19000
processing 13500
processing 5000
processing 16500
processing 8000
processing 11000
processing 2500
processing 19500
processing 14000
processing 5500
processing 17000
processing 8500
processing 11500
processing 29000
processing 20000
processing 20500
processing 32000
processing 23500
processing 35000
processing 26500
processing 21000
processing 29500
processing 32500
processing 38000
processing 24000
processing 35500
processing 27000
processing 21500
processing 30000
processing 33000
processing 24500
processing 38500
processing 27500
processing 36000
processing 22000
processing 30500
processing 33500
processing 25000
processing 39000
processing 28000
processing 36500
processing 22500
processing 31000
processing 25500
processing 34000
processing 28500
processing 39500
processing 37000
processing 23000
processing 31500
processing 40500
processing 26000
processing 34500
processing 43500
processing 40000
processing 46500
processing 37500
processing 41000
processing 55000
processing 49500
processing 52500
processing 44000
processing 47000
processing 55500
processing 58000
processing 50000
processing 41500
processing 53000
processing 44500
processing 58500
processing 56000
processing 47500
processing 50500
processing 42000
processing 53500
processing 59000
processing 56500
processing 45000
processing 48000
processing 51000
processing 54000
processing 42500
processing 59500
processing 57000
processing 45500
processing 51500
processing 48500
processing 54500
processing 60000
processing 57500
processing 43000
processing 60500
processing 52000
processing 46000
processing 49000
processing 67000
processing 67500
processing 61000
processing 64000
processing 61500
processing 64500
processing 62000
processing 65000
processing 65500
processing 62500
processing 66000
processing 63000
processing 66500
processing 63500
isolated: 0
Number of edges: 76499
Total edit distance: 14654968
Avg ed (ed/edges): 191.57071334265808
Traceback (most recent call last):
File "/Users/jon/miniconda3/envs/amptk_dev/bin/IsoCon", line 292, in <module>
run_pipeline(params)
File "/Users/jon/miniconda3/envs/amptk_dev/bin/IsoCon", line 159, in run_pipeline
candidate_file, read_partition, to_realign = isocon_get_candidates.find_candidate_transcripts(params.read_file, params)
File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/isocon_get_candidates.py", line 129, in find_candidate_transcripts
G_star, graph_partition, M, converged = partitions.partition_strings(S, params)
File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/partitions.py", line 420, in partition_strings
G_star, converged = graphs.construct_exact_nearest_neighbor_graph(S, params)
File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/graphs.py", line 63, in construct_exact_nearest_neighbor_graph
if G.node[s1]["degree"] > 1:
AttributeError: 'DiGraph' object has no attribute 'node'
Hi @nextgenusfs, thank you appreciate it!
Regarding the runtime bug: I actually fixed it today (see #6). So you can either reinstall IsoCon (version 0.3.3) by removing and reinstall from scratch, or it is simply sufficient to downgrade networkx to 2.3 as follows:
pip uninstall networkx
pip install networkx==2.3
As for the strategy I will get back tomorrow (it's pretty late in my timezone). Here are some general comments on IsoCon for ONT sequencing:
- IsoCon does not handle reverse complements. If you have reverse complemented sequences the predictions will contain the sequence and its reverse complement, but maybe that’s easy to post-filter? Another strategy is to identify the primers beforehand and re-orient the reverse complements (to speed up runtime of IsoCon even more).
- There will be some redundant consensus due to the different ONT error profile compared to IsoSeq data. Therefore, parameters to specify would be
--max_phred_q_trusted 20
(default is 43 for hihger quality CCS reads) and--p_value_threshold 0.00001
(instead of default 0.01). This could however also be post-filtered by simply removing consensus with a p-value larger than e.g., 0.00001 (the p-value is printed to the accession of the consensus sequence)
Great thanks. I downgraded networkx and I'll give it a re-try right now with your suggested ONT parameters.
The data I'm using is published, but I've already oriented and trimmed primers so I'm only trying to feed IsoCon the "cleaned up" data in hopes of being able to pick cluster centroids.
Also, regarding isONclust. you could increase cluster thresholds e.g., --mapped_threshold 0.9
--aligned_threshold 0.7
, (and perhaps -k 12 -w 15 if runtime allows it) this will be more stringent (more clusters).
Also note that isONclust has an experimental --consensus
parameter that performs what you said: spoa
then medaka
on each cluster. It may be convenient.