🧬 Clustering Proteins 🧬

Project for the AI Saturdays Murcia

🎯 Objetivo

Poder agrupar JERARQUICAMENTE las secuencias (que contengan el dominio Macro) de forma óptima. Empezar por la familia macro y luego hacer para otras familias.

Como una árbol filogenético.

💾 Datos

Datos de entrada:

Aminoacidos solo
Aminoacidos con su pertenencia a algún dominio si lo hubiere.

Proteinas que contienen el dominio Macro

Dataset	Num secuencias	Enlace
Pfam	8.832	https://pfam.xfam.org/family/Macro
Uniprot	39.133	https://www.uniprot.org/uniprot/?query=macro

Todas las proteinas

Dataset 26/2/2020	Num secuencias	Compr.	Descompr.	Descripción
UniProtKB Varsplic	40.255	8 MB	28 MB	Para pruebas pequeñas
UniProtKB Swissprot	561.911	85 MB	264 MB	Manually annotated and reviewed
UniProtKB TrEMBL	177.754.527	39.2 GB		Automatically annotated and not reviewed
UniRef50	39.178.216	7.3 GB		Hasta 50% de similaridad.
UniRef90	107.153.647	23.1 GB		Hasta 90% de similaridad.
UniRef100	216.491.817	51.1 GB		Todas.
UniParc	310.472.414	62.3 GB		Todo de todo.
Pfam (release 27.0)	21.827.419			Secuencia + Dominio. 16.479 families.
Protein Data Bank (PDB)	160.000			Secuencias + Estructura 3D

Enlaces
- UniProtKB: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/
- UniRef: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/
- UniParc: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/
- Pfam: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release

🖥️ Métodos

No Deep Learning

Countvectorizer: Contar cuantas veces aparece cada letra.
Term Frequency (TF): Contar cuantas veces aparece cada letra y dividir entre la longitud de la secuencia.
TF-IDF: Term-Frequency times Inverse Document-Frequency.
N-gramas
N-gramas con ruido

Deep Learning

Convolutional:
Recurrent:
- Simple RNN
- GRU
- LSTM
  - Con DropConnect
- mLSTM
- AWD-LSTM: Regular LSTM with extra dropouts. Used in ULMFiT
- QRNN: Quasi-Recurrent Neural Networks. Used in MultiFiT
Transformers:
- Attention: paper, vídeo (Jun 2017)
- BERT: paper, vídeo (Oct 2018)
- Transformer-XL: paper (Ene 2019)
- XLNet: paper, vídeo (Jun 2019)
- RoBERTa: paper, vídeo (Jul 2019)
- BART: paper (Oct 2019)
- Reformer: paper, vídeo, vídeo2 (Ene 2020)
- ELECTRA: paper, vídeo (Feb 2020)

Aprendizaje no supervisado (sólo para deep learning)

Redes recurrentes -> Language Model (LM) -> Predecir el siguinete aminoácido (ver ULMFiT)
Transformers -> Masked Language Model (MLM) -> Predecir el aminoacido oculto (ver BERT)
Transformers -> Next Sentence Prediction (NSP) -> Predecir si subsecuencias son consecutivas o no (ver BERT)
Transformers -> Replaced Token Detection (RTD) -> Predecir si amonoácido real o no (ver ELECTRA)

Biology Papers

More papers on https://github.com/yangkky/Machine-learning-for-proteins

3D shape of the protein
- Ultra-Deep Learning Model (2016)
- Protein secondary structure detection (Jul 2019)
- AlphaFold: From seq -> predict 3D shape
  - Paper in Nature (Ene 2020)
  - Paper in Proteins (Sep 2019)
Sequence of aminoacids of protein
- DeepDom: Predict protein domain boundaries (Ene 2019) BiLSTM
  - Paper
- Learning protein sequence embeddings using information from structure (Feb 2019) BiLSTM Unsupervised
  - Paper
- UniRep: Detect protein properties (Mar 2019) mLSTM Unsupervised
  - Twitter summary
  - Blog summary
  - Paper
  - Code (Tensorflow)
- Biological Structure and Function Emerge ... (Abr 2019, FAIR) Transformer Unsupervised ⭐
- TAPE: Evaluating Protein Transfer Learning (Jun 2019) ⭐
  - Blog summary
  - Paper
  - Code (Pytorch)
  - Models on library $ pip install tape_proteins
- UDSMProt: Detect protein properties (Sep 2019) AWD-LSTM Unsupervised Fast.ai ⭐
  - Paper
  - Github code
- Rosetta: Improved protein structure prediction using predicted inter-residue orientations (Nov 2019)
  - Paper
- PLUS: Pre-Training with Structural Information (Feb 2020) BiLSTM Transformer Unsupervised
  - Paper
  - Github code
  - Webpage: Data and models for download. ⬇️
- ProGen: Generate viable proteins based on user specs. (Mar 2020) Transformer Unsupervised ⭐
GAN:
- ProteinGAN (Oct 2019)

javiabellan/aisaturdays-proteins