Poder agrupar JERARQUICAMENTE las secuencias (que contengan el dominio Macro) de forma óptima. Empezar por la familia macro y luego hacer para otras familias.
Como una árbol filogenético.
Datos de entrada:
- Aminoacidos solo
- Aminoacidos con su pertenencia a algún dominio si lo hubiere.
Dataset | Num secuencias | Enlace |
---|---|---|
Pfam | 8.832 | https://pfam.xfam.org/family/Macro |
Uniprot | 39.133 | https://www.uniprot.org/uniprot/?query=macro |
Dataset 26/2/2020 | Num secuencias | Compr. | Descompr. | Descripción |
---|---|---|---|---|
UniProtKB Varsplic | 40.255 | 8 MB | 28 MB | Para pruebas pequeñas |
UniProtKB Swissprot | 561.911 | 85 MB | 264 MB | Manually annotated and reviewed |
UniProtKB TrEMBL | 177.754.527 | 39.2 GB | Automatically annotated and not reviewed | |
UniRef50 | 39.178.216 | 7.3 GB | Hasta 50% de similaridad. | |
UniRef90 | 107.153.647 | 23.1 GB | Hasta 90% de similaridad. | |
UniRef100 | 216.491.817 | 51.1 GB | Todas. | |
UniParc | 310.472.414 | 62.3 GB | Todo de todo. | |
Pfam (release 27.0) | 21.827.419 | Secuencia + Dominio. 16.479 families. | ||
Protein Data Bank (PDB) | 160.000 | Secuencias + Estructura 3D |
- Enlaces
- UniProtKB: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/
- UniRef: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/
- UniParc: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/
- Pfam: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release
- Countvectorizer: Contar cuantas veces aparece cada letra.
- Term Frequency (TF): Contar cuantas veces aparece cada letra y dividir entre la longitud de la secuencia.
- TF-IDF: Term-Frequency times Inverse Document-Frequency.
- N-gramas
- N-gramas con ruido
- Convolutional:
- Recurrent:
- Simple RNN
- GRU
- LSTM
- Con DropConnect
- mLSTM
- AWD-LSTM: Regular LSTM with extra dropouts. Used in ULMFiT
- QRNN: Quasi-Recurrent Neural Networks. Used in MultiFiT
- Transformers:
- Redes recurrentes -> Language Model (LM) -> Predecir el siguinete aminoácido (ver ULMFiT)
- Transformers -> Masked Language Model (MLM) -> Predecir el aminoacido oculto (ver BERT)
- Transformers -> Next Sentence Prediction (NSP) -> Predecir si subsecuencias son consecutivas o no (ver BERT)
- Transformers -> Replaced Token Detection (RTD) -> Predecir si amonoácido real o no (ver ELECTRA)
More papers on https://github.com/yangkky/Machine-learning-for-proteins
- 3D shape of the protein
- Ultra-Deep Learning Model (2016)
- Protein secondary structure detection (Jul 2019)
- AlphaFold: From seq -> predict 3D shape
- Paper in Nature (Ene 2020)
- Paper in Proteins (Sep 2019)
- Sequence of aminoacids of protein
- DeepDom: Predict protein domain boundaries (Ene 2019)
BiLSTM
- Learning protein sequence embeddings using information from structure (Feb 2019)
BiLSTM
Unsupervised
- UniRep: Detect protein properties (Mar 2019)
mLSTM
Unsupervised
- Twitter summary
- Blog summary
- Paper
- Code (Tensorflow)
- Biological Structure and Function Emerge ... (Abr 2019, FAIR)
Transformer
Unsupervised
⭐ - TAPE: Evaluating Protein Transfer Learning (Jun 2019) ⭐
- Blog summary
- Paper
- Code (Pytorch)
- Models on library
$ pip install tape_proteins
- UDSMProt: Detect protein properties (Sep 2019)
AWD-LSTM
Unsupervised
Fast.ai
⭐ - Rosetta: Improved protein structure prediction using predicted inter-residue orientations (Nov 2019)
- PLUS: Pre-Training with Structural Information (Feb 2020)
BiLSTM
Transformer
Unsupervised
- Paper
- Github code
- Webpage: Data and models for download. ⬇️
- ProGen: Generate viable proteins based on user specs. (Mar 2020)
Transformer
Unsupervised
⭐
- DeepDom: Predict protein domain boundaries (Ene 2019)
- GAN:
- ProteinGAN (Oct 2019)