Protvec: Amino Acid Embedding Representation for Machine Learning Features

Objectives

Extract features from amino acid sequences for machine learning
Use features to predict protein family and other structural properties

Requirements

anaconda3
Python 3.4
Tensorflow
Keras
joblib - for multiprocessing - pip install joblib

Abstract

This project attempts to reproduce the results from Asgari 2015 and to expand it to phage sequences and their protein families. Currently, Asgari's classification of protein families can be reproduced with his using his trained embedding.. However, his results cannot be reproduced with current attempts to train using the skip-gram negative sampling method detailed in this tutorial. Training samples have been attempted with the SwissProt database.

Introduction

Predicting protein function with machine learning methods require informative features that is extracted from data. A natural language processing (NLP) technique, known as Word2Vec is used to represent a word by its context with a vector that encodes for the probability a context would occur for a word. These vectors are effective at representing meanings of words since words with similar meanings would have similar contexts. For example, the word cat and kitten would have similar contexts that they are used in since they have very similar meanings. These words would thus have very similar vectors.

Methods

Preprocessing
1. Load dataset containing protein amino acid sequences and Asgari's embedding
2. Convert sequences to three lists of non-overlapping 3-mer words
3. Convert 3-mers to numerical encoding using kmer indicies from Asgari's embedding (row dimension)
4. Generate skipgrams with Keras function
  Output: target word, context word, label
  Label refers to true or false target/context pairing generated for the negative sampling technique
Training embedding
1. Create negative sampling skipgram model with Keras using technique from this tutorial
Generate ProtVecs from embedding for a given protein sequence
1. Break protein sequence to list of kmers
2. Convert kmers to vectors by taking the dot product of its one hot vector with the embedding
3. Sum up all vectors for all kmers for a single vector representation for a protein (length 100)
Classify protein function with ProtVec features (results currently not working, refer to R script)
1. Use protvecs as training features
2. Use pfam as labels
3. For a given pfam classification, perform binary classification with all of its positive samples and randomly sample an equal amount of negative samples
4. Train SVM model

Resources

Intuition behind Word2Vec http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Tutorial followed for implementation of skip-gram negative sampling (includes code) http://adventuresinmachinelearning.com/word2vec-keras-tutorial/
Introduction to protein function prediction http://biofunctionprediction.org/cafa-targets/Introduction_to_protein_prediction.pdf

Author

Mike Huang
huangjmike@gmail.com

renjieli2932/PhageProtVec