Requires Python 3.9 and a MEGAN Map to perform an LCA analysis on metagenomic sequences from alignment data. Refer to LCA analysis of metagenomic sequences in Python or read the doc strings for further information.
Import the pygan
module and call its run
API to perfrom an analysis.
pygan.run(tre_file='resources/ncbi.tre',
map_file='resources/ncbi.map',
megan_map_file='resources/megan-map-Jan2021.db',
blast_file='resources/Alice01-1mio-Jan-2021.txt',
blast_map={'qseqid': 0, 'sseqid': 1, 'bitscore': 2},
top_score_percent=0.1,
db_segment_size=10000, db_key='Taxonomy',
ignore_ancestors=False, min_support=100, only_major=False,
exclude=['genus'], project_mode='accession', project_rank='genus',
cluster_degree=1, out_file='lca_analysis.txt',
prefix_rank=True, show_path=False, list_reads=False)
Path to file containing taxonomy tree. NCBI or GTDB taxonomy recommended.
Path to file containing mapping of taxonomy id to scientific name and rank. Must correspond to tre_file
.
Path to file containing megan_map.db. It is highly recommended to store the database on a medium with fast reading (SSD).
Path to file containing alignment data.
Mapping of which column of the alignment data qseqid, sseqid and bitscore are in. E.g. {'qseqid': 0, 'sseqid': 1, 'bitscore': 2}
. Required to parse the alignment data.
Parameter used in the top score filter. Value must be between 0 and 1. An item within the percentage of the top score stays in the read. E.g. top score = 50, top_score_percent = 0.1: 47 remains, 43 is discarded. Prunes alignment data.
Parameter to optimize performance of accession to taxon ID mapping. Different hardware may work better with different values. Recommended are values between 5,000 and 25,000.
Taxonomy to map accessions to. Use 'Taxonomy'
for NCBI and 'gtdb'
for GTDB.
Whether to ignore ancestors in the LCA algorithm. When True
no ancestor of another accesion can be the resulting lowest common ancestor.
Limit for the minimum support filter algorithm. All nodes with less reads than the minimum support limit forfeit their read to their parent nodes.
When True
reads are forcibly pushed upwards to nodes of major ranks.
List of ranks to be exempted by the minimum support filter.
Method of read projection to a specific rank. An attempt is made to contain all reads only in nodes of the target rank. Available methods are: ´'accession'´, ´'proportional'´, and ´'mixed'´ or leave empty ''
.
Target rank for read projection.
When choosing the accession
method, specify cluster degree of at least 1. Potential hits are clustered. Reduces false-positives.
Path to output file. Output is generated as plain text.
When True
add an abbrevation of a node's rank to its name when printing.
When True
print the entire path to the node instead of only its name
When True
print the list of a node's mapped read IDs. When False
print the number of a node's mapped reads.
After familiarizing with the parameters and doc strings, script the analysis yourself or perform it in a REPL.
tree = parse_tree(tre_file, map_file)
id2address, address2id = compute_lca_addresses(tree)
reads, read_ids = parse_blast_filter(blast_file, top_score_percent, blast_map)
mapped_reads = map_accessions(reads, megan_map_file, db_segment_size, db_key)
map_lcas(tree, id2address, address2id, mapped_reads, read_ids, ignore_ancestors)
project_reads_to_rank(project_mode, tree, project_rank, mapped_reads, read_ids, cluster_degree)
apply_min_sup_filter(tree, min_support, exclude, only_major)
write_results(tree, out_file, prefix_rank, show_path, list_reads)
Takes the taxonomy data and produces an empty taxonomy tree.
Bidirectionally computes the address for each node in the phylogenetic tree. Later used in the LCA algorithm.
Parse alignment data from a tabulated text file and apply the top score filter while parsing. To apply the top score filter specifically after parsing refer to parse_blast_with_score
. Manually apply the top score filter with filter_reads_by_top_score
.
Maps accessions to taxons of the phylogenetic tree. Key must correspond to the specified taxonomy. Adjust db_chunk_size
as necessary. Use map_accessions_with_scores
to map reads that have not been filtered yet.
Populate the taxonomy with reads by applying the LCA algorithm.
An attempt is made to contain all reads only in nodes of a specified rank. Available methods are: ´'accession'´, ´'proportional'´, and ´'mixed'´.
Applies the minimum support filter to the tree. Nodes that have less reads than the minimum support limit forfeit their reads to their parent nodes.
Generate plain text output of the taxonomy. Can prefix an abbrevation of the rank, show the entire path to the node and either list all read IDs or just show their number.
save_to_bin
and load_from_bin
allows for (de)serialization of data. May be useful to avoid multiple accession mappings or to store partial results of the analysis. Use timer
to time your analysis duration.