ggavelis
Greetings, here you'll find sample code that I'm migrating from across my institutional GitHub accounts.
Bigelow Laboratory for Ocean ScienceBoothbay, Maine
Pinned Repositories
eukaryotesV4
Curated database of V4 region of 18S rDNA
gorg-classifier-v2
Recruit, classify, and annotate reads against the Global Ocean Reference Genome dataset of SAGs
GORG-HGT-bioinfo
automated_phylogenies
cc2022-cowsay
condense_InterProScan_annots
InterProScan is useful, but its annots are multiline and redundant. This collapses them into a single, human-readable line for each annotated sequence.
HGT_v_Contamination_assessor
How can we discriminate contaminants from HGT? Alien indices are often used to screen out foreign sequences, but can 'overclean' by removing bona fide HGT. This script leverages metadata about each DNA/AA sequence (i.e. whether it is spliced, has a polyA tail or spliced leader), and uses that to assess the extent to which AI-based cleaning is removing legitimate HGT.
infer_splice_variants_from_Trinity
Did you know that Trinity predicts splice variants? (Chrysalis works even for de novo transcriptomes, and its input--though heuristic--is valuable). Likewise, TransDecoder can predict multiple ORFs per protein--potentially capturing alternative splicing. These predictions are usually lost once we rename our proteins to shorter seqids. This script stores and abbreviates potential splicing info for later use--e.g. for discerning prokaryotic from eukaryotic transcripts.
Protein_renamer
Tools to add phylogeny-ready names (including accession, genus, species, lineage & taxid) to protein fastas from any of (A) genbank (B) SRA (C) Genome_paper_supp_data
tblastn_exon_stitcher
Need an ORF from an unannotated genome? This script exploits ncbi's ability to tBLASTn against genome assemblies, to get provisional exon sets from BLAST query-hit alignments.
ggavelis's Repositories
ggavelis/automated_phylogenies
ggavelis/cc2022-cowsay
ggavelis/condense_InterProScan_annots
InterProScan is useful, but its annots are multiline and redundant. This collapses them into a single, human-readable line for each annotated sequence.
ggavelis/HGT_v_Contamination_assessor
How can we discriminate contaminants from HGT? Alien indices are often used to screen out foreign sequences, but can 'overclean' by removing bona fide HGT. This script leverages metadata about each DNA/AA sequence (i.e. whether it is spliced, has a polyA tail or spliced leader), and uses that to assess the extent to which AI-based cleaning is removing legitimate HGT.
ggavelis/infer_splice_variants_from_Trinity
Did you know that Trinity predicts splice variants? (Chrysalis works even for de novo transcriptomes, and its input--though heuristic--is valuable). Likewise, TransDecoder can predict multiple ORFs per protein--potentially capturing alternative splicing. These predictions are usually lost once we rename our proteins to shorter seqids. This script stores and abbreviates potential splicing info for later use--e.g. for discerning prokaryotic from eukaryotic transcripts.
ggavelis/Protein_renamer
Tools to add phylogeny-ready names (including accession, genus, species, lineage & taxid) to protein fastas from any of (A) genbank (B) SRA (C) Genome_paper_supp_data
ggavelis/tblastn_exon_stitcher
Need an ORF from an unannotated genome? This script exploits ncbi's ability to tBLASTn against genome assemblies, to get provisional exon sets from BLAST query-hit alignments.