TFIDFTextEncoder is a class that wraps the text embedding functionality of a TFIDF model.
The TFIDF model is a classic vector representation for information retrieval.
TfidfTextEncoder
encodes data from a DocumentArray
and updates the doc.embedding
attributes with a scipy.csr_matrix
of floating point values for each doc in DocumentArray.
Table of Contents
You need a TF-IDF vectorizer pretrained.
The TFIDFTextEncoder
uses a sklearn.feature_extraction.text.TfidfVectorizer
object that needs to be fitted and stored as a pickle object which the TFIDFTextEncoder
will load from path_vectorizer
. By default path_vectorizer='model/tfidf_vectorizer.pickle'
.
The following snipped can be used to fit a TfidfVectorizer
with a toy corpus. To achieve better performance or adapt the encoder to other languages you can change load_data
function from below to load any other user specific dataset.
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
def load_data():
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
return newsgroups_train.data
if __name__ == '__main__':
X = load_data()
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X)
pickle.dump(tfidf_vectorizer, open('tfidf_vectorizer.pickle', 'wb'))
Use the prebuilt images from JinaHub in your python codes,
from jina import Flow
f = Flow().add(uses='jinahub+docker://TFIDFTextEncoder')
or in the .yml
config.
jtype: Flow
pods:
- name: encoder
uses: 'jinahub+docker://TFIDFTextEncoder'
Use the source codes from JinaHub in your python codes,
from jina import Flow
f = Flow().add(uses='jinahub://TFIDFTextEncoder')
or in the .yml
config.
jtype: Flow
pods:
- name: encoder
uses: 'jinahub://TFIDFTextEncoder'
-
Install the
jinahub-executor-text-tfidfencoder
package.pip install git+https://github.com/jina-ai/executor-text-tfidfencoder.git
-
Use
jinahub-executor-text-tfidfencoder
in your codefrom jina import Flow from jinahub.encoder.executor-text-tfidfencoder import TFIDFTextEncoder f = Flow().add(uses=TFIDFTextEncoder)
-
Clone the repo and build the docker image
git clone https://github.com/jina-ai/executor-text-tfidfencoder.git cd executor-text-tfidfencoder docker build -t executor-text-tfidfencoder-image .
-
Use
executor-text-tfidfencoder
in your codefrom jina import Flow f = Flow().add(uses='docker://executor-text-tfidfencoder:latest')
from jina import Flow, Document
f = Flow().add(uses='jinahub+docker://TFIDFTextEncoder')
with f:
resp = f.post(inputs=Document(text='Han eats pizza'), return_resutls=True)
print(f'{resp}')
Documents with text
. By default, the input text
must be a unicode string.
Documents with embedding
fields filled with an scipy.sparse.csr_matrix
of the shape n_vocabulary
.