This is the implement of a sentence embedding algorithm in the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings" in Python3 and in Chinese corpus.
$ pip install -r requirements.txt
To get started, you need:
- A corpus to train word2vec model and get frequency of word.
- A corpus of sentences (here is some question about tea in Chinese).
Then:
- Config the path of data in
process_data.py
. - run the
process_data.py
to get adict
from word to frequency. - run the
main.py
to get a similarity task test.
process_data.py
provides the function to build thedict
from word to frequency for a corpus.params.py
provides a ClassParams
to pack the parameters in to a objectsif_embedding.py
provides the function to get the weighted embedding, SIF embedding for sentences and a demo of the similarity task.