Jina banner

TFIDFTextEncoder

✨ TFIDFTextEncoder

TFIDFTextEncoder is a class that wraps the text embedding functionality of a TFIDF model.

The TFIDF model is a classic vector representation for information retrieval.

TfidfTextEncoder encodes data from a DocumentArray and updates the doc.embedding attributes with a scipy.csr_matrixof floating point values for each doc in DocumentArray.

Table of Contents

🌱 Prerequisites

You need a TF-IDF vectorizer pretrained.

Pretraining a TF-IDF Vectorizer

The TFIDFTextEncoder uses a sklearn.feature_extraction.text.TfidfVectorizerobject that needs to be fitted and stored as a pickle object which the TFIDFTextEncoder will load from path_vectorizer. By default path_vectorizer='model/tfidf_vectorizer.pickle' .

The following snipped can be used to fit a TfidfVectorizer with a toy corpus. To achieve better performance or adapt the encoder to other languages you can change load_data function from below to load any other user specific dataset.

from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

def load_data():
    from sklearn.datasets import fetch_20newsgroups
    newsgroups_train = fetch_20newsgroups(subset='train')
    return newsgroups_train.data

if __name__ == '__main__':
    X = load_data()    
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectorizer.fit(X)
    pickle.dump(tfidf_vectorizer, open('tfidf_vectorizer.pickle', 'wb'))

🚀 Usages

🚚 Via JinaHub

using docker images

Use the prebuilt images from JinaHub in your python codes,

from jina import Flow
	
f = Flow().add(uses='jinahub+docker://TFIDFTextEncoder')

or in the .yml config.

jtype: Flow
pods:
  - name: encoder
    uses: 'jinahub+docker://TFIDFTextEncoder'

using source codes

Use the source codes from JinaHub in your python codes,

from jina import Flow
	
f = Flow().add(uses='jinahub://TFIDFTextEncoder')

or in the .yml config.

jtype: Flow
pods:
  - name: encoder
    uses: 'jinahub://TFIDFTextEncoder'

📦️ Via Pypi

  1. Install the jinahub-executor-text-tfidfencoder package.

    pip install git+https://github.com/jina-ai/executor-text-tfidfencoder.git
  2. Use jinahub-executor-text-tfidfencoder in your code

    from jina import Flow
    from jinahub.encoder.executor-text-tfidfencoder import TFIDFTextEncoder
    
    f = Flow().add(uses=TFIDFTextEncoder)

🐳 Via Docker

  1. Clone the repo and build the docker image

    git clone https://github.com/jina-ai/executor-text-tfidfencoder.git
    cd executor-text-tfidfencoder
    docker build -t executor-text-tfidfencoder-image .
  2. Use executor-text-tfidfencoder in your code

    from jina import Flow
    
    f = Flow().add(uses='docker://executor-text-tfidfencoder:latest')

🎉️ Example

from jina import Flow, Document

f = Flow().add(uses='jinahub+docker://TFIDFTextEncoder')

with f:
    resp = f.post(inputs=Document(text='Han eats pizza'), return_resutls=True)
	print(f'{resp}')

Inputs

Documents with text. By default, the input textmust be a unicode string.

Returns

Documents with embedding fields filled with an scipy.sparse.csr_matrix of the shape n_vocabulary.

🔍️ Reference

https://en.wikipedia.org/wiki/Tf-idf