/entities-search-engine

Scripts and microservice to feed an ElasticSearch with Wikidata and Inventaire entities, and keep those up-to-date

Primary LanguageJavaScript

⚠️ This repository has been archived as now the inventaire server itself takes care of keeping Elasticsearch entities and wikidata indexes updated

Entities Search Engine

Scripts and microservice to feed an ElasticSearch with Wikidata and Inventaire entities (see entities map), and keep those up-to-date, to answer questions like "give me all the humans with a name starting by xxx" in a super snappy way, typically for the needs of an autocomplete field.

For the Wikidata-only version see the archived branch #wikidata-subset-search-engine branch.

Summary

Setup

see setup

Dependencies

see setup to install dependencies:

  • NodeJs >= v6.4
  • ElasticSearch (this repo was developed targeting ElasticSearch v2.4, but it should work with newer version with some minimal changes)
  • Nginx
  • Let's Encrypt
  • already installed in any good nix system: curl, gzip

Start server

see Wikidata and Inventaire per-entity import

Data imports

from scratch

add

Wikidata entities

3 ways to import Wikidata entities data into your ElasticSearch instance

Inventaire entities

update

To update any entity, simply re-add it, typically by posting its URI (ex: 'wd:Q180736' for a Wikidata entity, or 'inv:9cf5fbb9affab552cd4fb77712970141' for an Inventaire one) to the server

remove

To un-index entities that were mistakenly added, pass the path of a results json file, supposedly made of an array of ids. All those ids' documents will be deleted

index=wikidata
type=humans
ids_json_array=./queries/results/mistakenly_added_wikidata_humans_ids.json
npm run delete-from-results $index $type $ids_json_array

index=entities-prod
type=works
ids_json_array=./queries/results/mistakenly_added_inventaire_works_ids.json
npm run delete-from-results $index $type $ids_json_array

importing dumps

You can import dumps from inventaire.io prod elasticsearch instance:

# Download Wikidata dump
wget -c https://dumps.inventaire.io/wd/elasticsearch/wikidata_data.json.gz
gzip -d wikidata_data.json.gz
# elasticdump should have been installed when running `npm install`
# --limit: increasing batches size
./node_modules/.bin/elasticdump --input=./wikidata_data.json --output=http://localhost:9200/wikidata --limit 2000

# Same for Inventaire
wget -c https://dumps.inventaire.io/inv/elasticsearch/entities_data.json.gz
gzip -d entities_data.json.gz
./node_modules/.bin/elasticdump --input=./entities_data.json --output=http://localhost:9200/entities --limit 2000

Query ElasticSearch

curl "http://localhost:9200/wikidata/humans/_search?q=Victor%20Hugo"

References

Donate

We are developing and maintaining tools to work with Wikidata from NodeJS, the browser, or simply the command line, with quality and ease of use at heart. Any donation will be interpreted as a "please keep going, your work is very much needed and awesome. PS: love". Donate

See Also

You may also like

inventaire banner

Do you know inventaire.io? It's a web app to share books with your friends, built on top of Wikidata! And its libre software too.

License

AGPL-3.0