Pinned Repositories
10genomes
These are the scripts from my phylogenetics paper. I don't think they will work as is, I just wanted to back them up. Please don't judge me for their inelegance, I made them when the only programming language I had any experience with was AWK, and not much of that.
Interactome
maize
Machine-learning-for-proteins
Listing of papers about machine learning for proteins.
merlin-p
A prior-based integrative framework for functional transcriptional regulatory network inference (Fotuhi Siahpirani & Roy, Nucleic Acids Research 2017)
personal-website
Code that'll help you kickstart a personal website that showcases your work as a software developer.
SequenceClusterScripts
A set of scripts to analyze the output of a clustering program, such as orthoMCL.In their current version, most of this scripts are adapted to our local environment, and using JGI genome annotation
SingleSplice
Algorithm for detecting alternative splicing in a population of single cells. See details in Welch et al., Nucleic Acids Research 2016: http://nar.oxfordjournals.org/content/early/2016/01/05/nar.gkv1525.full
SRW-PPINetworks
Supervised Random Walk in PPI Networks
maggishaggy's Repositories
maggishaggy/2018-11-EMBORome
Training materials from the EMBO course Computational analysis of protein-protein interactions: Sequences, networks and diseases, taking place from 05 to 10 November 2018 in Rome, Italy.
maggishaggy/ChemoGeneDetection
Pipeline for annotating chemosensory genes
maggishaggy/clusterDbAnalysis
ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
maggishaggy/common_scripts
my bin directory
maggishaggy/cookbook-2nd
IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018
maggishaggy/crank
Prioritizing network communities
maggishaggy/dash-network
D3 force-layout network diagram for Dash
maggishaggy/deeplearning-biology
A list of deep learning implementations in biology
maggishaggy/evoppi-backend
Backend for the EvoPPI application
maggishaggy/evoppi-frontend
Front-end for the EvoPPI application
maggishaggy/fly-interactome
Fly interactome compiled from a set of publicly-available sources.
maggishaggy/Integrative_network_analysis
The python scripts for the integrative framework that uses PPI and anatomy ontology data to study evolutionary phenotypic transitions
maggishaggy/iSAFE
Pinpoints the mutation favored by selection
maggishaggy/lgraph
Another graph library. Social networks, path-finding algorithms, graph generation, and more.
maggishaggy/Meta_QTL-LA-Mapping-Paper
This repository houses the data and scripts used in the paper by Dzievit et al. (Submitted to The Plant Genome Journal 5/5/2018). Additionally, it also houses the updatable version of the LA QTL identified in maize for other researchers to view and update.
maggishaggy/notebook_to_web
Generate a web-page like output from an exported HTML notebook
maggishaggy/Overlapping_Clustering_Testing
Detangling PPI Network for Overlapping Clusters
maggishaggy/pancanatlas_code_public
Public repository containing research code for the TCGA PanCanAtlas Splicing project
maggishaggy/phyphy
Python HyPhy: Facilitating HyPhy execution and parsing
maggishaggy/ppi-network-alternative
Embedding Alternative Conformations of Proteins in Protein-protein interaction networks
maggishaggy/PPI-Network-Analysis-3
Protein-protein interaction network constructed with STRING database
maggishaggy/ppi-network-annotation
maggishaggy/PPIN
Protein-Protein interaction network analysis pipeline
maggishaggy/pyvis
Python package for creating and visualizing interactive network graphs.
maggishaggy/RNA-seq-analysis
RNAseq analysis notes from Ming Tang
maggishaggy/snap
Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library.
maggishaggy/SteinerNet
R Package: Steiner Tree Approach for Graph Analysis
maggishaggy/SURFY2
This repository constitutes SURFY2 and corresponds to the bioRxiv preprint 'Updating the in silico human surfaceome with meta-ensemble learning and feature engineering' by Daniel Bojar. SURFY2 is a machine learning classifier to predict whether a human transmembrane protein is located at the surface of a cell (the plasma membrane) or in one of the intracellular membranes based on the sequence characteristics of the protein. Making use of the data described in the recent publication from Bausch-Fluck et al. (https://doi.org/10.1073/pnas.1808790115), SURFY2 considerably improves on their reported classifier SURFY in terms of accuracy (95.5%), precision (94.3%), recall (97.6%) and area under ROC curve (0.954) when using a test set never seen by the classifier before. SURFY2 consists of a layer of 12 base estimators generating 24 new engineered features (class probabilities for both classes) which are appended to the original 253 features. Then, a soft voting classifier with three optimized base estimators (Random Forest, Gradient Boosting and Logistic Regression) and optimized voting weights is trained on this expanded dataset, resulting in the final prediction. The motivation of SURFY2 is to provide an updated and better version of the in silico human surfaceome to facilitate research and drug development on human surface-exposed transmembrane proteins. Additionally, SURFY2 enabled insights into biological properties of these proteins and generated several new hypotheses / ideas for experiments. The workflow is as following: 1) dataPrep Gets training data from data.xlsx, labels it according to surface class and outputs 'train_data.csv' 2) split Gets train_data.csv, splits it into training, validation and test data and outputs 'train.csv', 'val.csv', 'test.csv'. 3) main_val Was used for optimizing hyperparameters of base estimators and estimators & weights of voting classifier. Stores all estimators. Evaluates meta-ensemble classifier SURFY2 on validation set. 4) classifier_selection All base estimators and meta-ensemble approaches are tested on the initial dataset as well as the expanded dataset including the engineered features and compared in terms of their cross-validation score. 5) main_test Evaluates SURFY2 on the separate test set (trained on training + validation set). 6) testing_SURFY Evaluates the original SURFY through cross-validation and on validation as well as test set. 7) pred_unlabeled Uses SURFY2 to predict the surface label (+ prediction score) for unlabeled proteins in data.xlsx. Also gets the feature importances of the voting classifier estimators. 8) getting_discrepancies Compare predictions with those made by SURFY ('surfy.xlsx') and store mismatches. Also store the 10 most confident mismatches (by SURFY2 classification score) from each class. 9) feature_importances Plot the 10 most important features for the voting classifier estimators (Random Forest, Gradient Boosting, Logistic Regression) to interpret predictions. 10) base_estimator_importances Plot the 10 most important features for the two most important base estimators (XGBClassifier and Gradient Boosting). 11) comparing_mismatches Separate datasets into shared & discrepant predictions (between SURFY and SURFY2). Compare feature means and select features with the highest class feature mean differences between prediction datasets. Statistically analyze differences in features means between classes in both prediction datasets. Plot 9 representative features with their means grouped according to class and prediction dataset to rationalize discrepant predictions. 12) tSNE_surfy2 Perform nonlinear dimensionality reduction using t-SNE on proteins with predictions from both SURFY and SURFY2. Plot the two t-SNE dimensions and label the proteins according to their prediction class in order to see where discrepant predictions reside in the landscape. Plot surface proteins with most prevalent annotated functional subclasses and label them according to their subclass to enable comparison to class predictions. Functional annotations came from 'surfy.xlsx'.
maggishaggy/sysSVM
Patient-specific cancer driver prediction using support vector machines and systems biology
maggishaggy/wgd
Python package and CLI for whole genome duplication analysis