/jina-wikipedia-sentences

Using Jina to search through sentences from English-language Wikipedia

Primary LanguagePythonApache License 2.0Apache-2.0

Search Wikipedia Sentences with Jina

This is an example of using Jina's neural search framework to search through a selection of individual Wikipedia sentences downloaded from Kaggle. It's based on code generated by jina hub new --type app.

Run in Docker

To test this example you can run a Docker image with 30,000 pre-indexed sentences:

docker run -p 65481:65481 jinahub/app.app.jina-wikipedia-sentences-30k

You can then query by running:

curl --request POST -d '{"top_k": 10, "mode": "search",  "data": ["text:hello world"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:65481/api/search'`

Setup

  1. pip install -r requirements.txt
  2. Set up Kaggle
  3. sh ./get_data.sh
  4. export JINA_DATA_PATH='data/input.txt'

Index

python app.py index

You can set the maximum documents to index with export MAX_DOCS=500

Search

python app.py search

Then:

curl --request POST -d '{"top_k": 10, "mode": "search",  "data": ["text:hello world"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:65481/api/search'

Or use Jinabox with endpoint http://127.0.0.1:65481/api/search

Build a Docker Image

This will create a Docker image with pre-indexed data and an open port for REST queries.

  1. Run all the steps in setup and index first. Don't run anything in the query step!
  2. If you want to push to Jina Hub be sure to edit the LABELs in Dockerfile to avoid clashing with other images
  3. Run docker build -t <your_image_name> . in the root directory of this repo
  4. Run it with docker run -p 65481:65481 <your_image_name>
  5. Search using instructions from Search above

Image name format

Please use the following name format for your Docker image, otherwise it will be rejected if you want to push it to Jina Hub. Please also see my versioning notes section before which explains my versioning workaround.

jinahub/type.kind.jina-image-name:image-jina_version

For example:

jinahub/app.app.jina-wikipedia-sentences-30k:0.2.3-0.9.5

Push to Jina Hub

  1. Ensure hub is installed with pip install jina[hub]
  2. Run jina hub login and paste the code into your browser to authenticate
  3. Run jina hub push <your_image_name>

Notes

Changes from Default

At the time of writing, jina hub new... creates an encode.yml with max_length: 96. I changed this to 196 which gives more accurate results (i.e. the query word actually appears in the text of the results)

Versioning Weirdness

At the time of writing, the version of Jina in requirements.txt doesn't match the jina_version label we use in our docker build ... command.

Why?

  • I built this example with Jina 0.8.2
  • Jina Hub expects you to push with the same version you built with (i.e. it would expect me to use jina[hub]==0.8.2 to push)
  • Pushing with Jina Hub wouldn't work for me until 0.9.5. Luckily jina hub push ... only cares about the Docker image, not my actual code (I ran jina[hub]==0.9.5 in a separate virtualenv)
  • We're working on updating this example code to 0.9.5 to get around this ugly kluge and delete this note!