/RA2Vec

Distributed Representation of Protein Sequences (Word2Vec)

Primary LanguageJupyter Notebook

RA2Vec

RA2Vec (Reduced Alphabet Embeddings) is a way to get distributed representation of amino acid sequences which then can be used for further downstream machine learning tasks. We have presented this work at 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics.

This repository contains,

  • source code for generating RA2Vec embeddings with different parameters,
  • few sample models which can directly used to convert sequences to embeddings, and
  • a demo notebook which shows how to use RA2Vec models to transform protein sequence dataset to embeddings.

Slides from my session on 'Key takeaways for Data Scientists from RA2Vec' are available here.

References:

  • Wijesekara, R.Y., Lahorkar, A., Rathore, K. and Valadi, J., 2020, September. RA2Vec: Distributed Representation of Protein Sequences with Reduced Alphabet Embeddings: RA2Vec: Distributed Representation. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (pp. 1-1), DOI.
  • Asgari, E. and Mofrad, M.R., 2015. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one, 10(11), p.e0141287, DOI.
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J., 2013. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546, DOI.