CBoW word embedding or MSE approach for it with context vector calculated by attention mechanism
- Modify model architecture for higher efficiency(may be change to torch.nn.transformer in the future)
- Add GPU support(Done)
- Fine tuning(NO GPU anymore hence no)