dinovski

New York Genome CenterNew York, NY

dinovski's Stars

formbio/laava
LAAVA: Long-read AAV Analysis
Language:Python61
karpathy/minbpe
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
Language:Python9.1k838
odlp/bluesnooze
Sleeping Mac = Bluetooth off
Language:Swift2.1k57
mims-harvard/PrimeKG
Precision Medicine Knowledge Graph (PrimeKG)
Language:Jupyter Notebook38485
nadavbra/protein_bert
Language:Jupyter Notebook47998
Bribak/SURFY2
This repository constitutes SURFY2 and corresponds to the bioRxiv preprint 'Updating the in silico human surfaceome with meta-ensemble learning and feature engineering' by Daniel Bojar. SURFY2 is a machine learning classifier to predict whether a human transmembrane protein is located at the surface of a cell (the plasma membrane) or in one of the intracellular membranes based on the sequence characteristics of the protein. Making use of the data described in the recent publication from Bausch-Fluck et al. (https://doi.org/10.1073/pnas.1808790115), SURFY2 considerably improves on their reported classifier SURFY in terms of accuracy (95.5%), precision (94.3%), recall (97.6%) and area under ROC curve (0.954) when using a test set never seen by the classifier before. SURFY2 consists of a layer of 12 base estimators generating 24 new engineered features (class probabilities for both classes) which are appended to the original 253 features. Then, a soft voting classifier with three optimized base estimators (Random Forest, Gradient Boosting and Logistic Regression) and optimized voting weights is trained on this expanded dataset, resulting in the final prediction. The motivation of SURFY2 is to provide an updated and better version of the in silico human surfaceome to facilitate research and drug development on human surface-exposed transmembrane proteins. Additionally, SURFY2 enabled insights into biological properties of these proteins and generated several new hypotheses / ideas for experiments. The workflow is as following: 1) dataPrep Gets training data from data.xlsx, labels it according to surface class and outputs 'train_data.csv' 2) split Gets train_data.csv, splits it into training, validation and test data and outputs 'train.csv', 'val.csv', 'test.csv'. 3) main_val Was used for optimizing hyperparameters of base estimators and estimators & weights of voting classifier. Stores all estimators. Evaluates meta-ensemble classifier SURFY2 on validation set. 4) classifier_selection All base estimators and meta-ensemble approaches are tested on the initial dataset as well as the expanded dataset including the engineered features and compared in terms of their cross-validation score. 5) main_test Evaluates SURFY2 on the separate test set (trained on training + validation set). 6) testing_SURFY Evaluates the original SURFY through cross-validation and on validation as well as test set. 7) pred_unlabeled Uses SURFY2 to predict the surface label (+ prediction score) for unlabeled proteins in data.xlsx. Also gets the feature importances of the voting classifier estimators. 8) getting_discrepancies Compare predictions with those made by SURFY ('surfy.xlsx') and store mismatches. Also store the 10 most confident mismatches (by SURFY2 classification score) from each class. 9) feature_importances Plot the 10 most important features for the voting classifier estimators (Random Forest, Gradient Boosting, Logistic Regression) to interpret predictions. 10) base_estimator_importances Plot the 10 most important features for the two most important base estimators (XGBClassifier and Gradient Boosting). 11) comparing_mismatches Separate datasets into shared & discrepant predictions (between SURFY and SURFY2). Compare feature means and select features with the highest class feature mean differences between prediction datasets. Statistically analyze differences in features means between classes in both prediction datasets. Plot 9 representative features with their means grouped according to class and prediction dataset to rationalize discrepant predictions. 12) tSNE_surfy2 Perform nonlinear dimensionality reduction using t-SNE on proteins with predictions from both SURFY and SURFY2. Plot the two t-SNE dimensions and label the proteins according to their prediction class in order to see where discrepant predictions reside in the landscape. Plot surface proteins with most prevalent annotated functional subclasses and label them according to their subclass to enable comparison to class predictions. Functional annotations came from 'surfy.xlsx'.
Language:Python54
rochacbruno/python-project-template
DO NOT FORK, CLICK ON "Use this template" - A github template to start a Python Project - this uses github actions to generate your project based on the template.
Language:Makefile1.1k167
BilkentCompGen/hercules
Profile HMM-based hybrid error correction algorithm for long reads
Language:C++204
jdbrody/dna-fountain
DNA-Fountain
Language:Python7
a-slide/pycoQC
pycoQC computes metrics and generates Interactive QC plots from the sequencing summary report generated by Oxford Nanopore technologies basecaller (Albacore/Guppy)
Language:Python26041
WangLabTHU/DeSP
DNA-D2S: a systematic error simulation Model for DNA Data Storage channel
Language:Jupyter Notebook102
huw-morris-lab/PDD_GWSS
Language:R2
gsneha26/urWGS
Ultra rapid nanopore whole genome sequencing pipeline, published in https://www.nature.com/articles/s41587-022-01221-5
Language:Shell182
evidentlyai/evidently
Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
Language:Jupyter Notebook5.2k587
patrickloeber/ml-study-plan
The Ultimate FREE Machine Learning Study Plan
2.8k395
facebookresearch/esm
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
Language:Python3.2k627
gencorefacility/r-notebooks
Gene Set Enrichment Analysis and Over Representation Analysis analysis using R
1725
parrt/msds621
Course notes for MSDS621 at Univ of San Francisco, introduction to machine learning
Language:Jupyter Notebook348175
brentp/vcfanno
annotate a VCF with other VCFs/BEDs/tabixed files
Language:Go36056
arvados/arvados
An open source platform for managing and analyzing biomedical big data
Language:Go378114
pgxcentre/genipe
Genome-wide imputation pipeline
Language:Python307
sysbio-curie/PROFILE
Repository of "Personalization of Logical Models With Multi-Omics Data Allows Clinical Stratification of Patients" paper
Language:HTML101
rstudio/cheatsheets
Posit Cheat Sheets - Can also be found at https://posit.co/resources/cheatsheets/.
Language:TeX5.8k1.8k
reinhardh/dna_rs_coding
Error correction scheme for storing information on DNA using Reed Solomon codes
Language:C++3011
Data4Democracy/ethics-resources
16532
ulfaslak/py_pcha
Python package that implements the PCHA algorithm for Archetypal Analysis by Mørup et. al.
Language:Python3611
humiaozuzu/awesome-flask
A curated list of awesome Flask resources and plugins
12.2k1.6k
nyukat/breast_cancer_classifier
Deep Neural Networks Improve Radiologists' Performance in Breast Cancer Screening
Language:Jupyter Notebook836268
rpsychologist/PubMed
DEPRECATED: PubMed datamining in R
Language:R4258
rstudio/reticulate
R Interface to Python
Language:R1.7k327

dinovski

dinovski's Stars

formbio/laava

karpathy/minbpe

odlp/bluesnooze

mims-harvard/PrimeKG

nadavbra/protein_bert

Bribak/SURFY2

rochacbruno/python-project-template

BilkentCompGen/hercules

jdbrody/dna-fountain

a-slide/pycoQC

WangLabTHU/DeSP

huw-morris-lab/PDD_GWSS

gsneha26/urWGS

evidentlyai/evidently

patrickloeber/ml-study-plan

facebookresearch/esm

gencorefacility/r-notebooks

parrt/msds621

brentp/vcfanno

arvados/arvados

pgxcentre/genipe

sysbio-curie/PROFILE

rstudio/cheatsheets

reinhardh/dna_rs_coding

Data4Democracy/ethics-resources

ulfaslak/py_pcha

humiaozuzu/awesome-flask

nyukat/breast_cancer_classifier

rpsychologist/PubMed

rstudio/reticulate