pelagios/recogito2

Should the annotations not be stored in postgres?

nicolasfranck opened this issue · 3 comments

Right now the AnnotationServer only stores its data in Elasticsearch.

From my experience, indexes are designed for speed, not for longterm
storage, so they expect data to be stored elsewhere too.

Wouldn't it be better to store the annotation in postgres first, and then
index them in elasticsearch? Besides: on the edit page,
annotations are only loaded for a certain document_id and filepart_id,
so there annotations can be loaded from postgres to reflect
the current state (there is a lapse in time before records become visible in ES).
In this case, ES would only be necessary for more advanced functionality
like statistics, facets..

Of course, this implies a way to "reindex" the data from postgres into ES,
possible using a command line script, or a task.

ES brands itself as, basically, a NoSQL store, too. We decided initially that the advantages far outweigh the problems that reindexing and keeping things in sync would cause. (We've also operated the system for 3+ years now and are very happy with ElasticSearch as "permanent storage". Methods for backup and restore are also pretty well advanced, although knock on wood we haven't had the need to use them so far.)

The decision for ElasticSearch was based on three factors:

  • Annotations are inherently deeply nested objects, so working with ElasticSearch is a really good fit (although I'm aware that there's a JSON datatype in Postgres, too.)
  • Annotations are the kind of data that are inserted at high speed, and ElasticSearch is really good and convenient for this. But again, not the main deciding factor.
  • Most importantly: there's a whole bunch of places where we need to run fast aggregation queries on annotation statistics (histograms over time, distinct place URIs, etc.) I still find the ElasticSearch query pretty much unmatched in terms of simplicity. In other words: this saved us writing code, no more, no less ;-)

I'd need to check, but I think we commit changes to the index for instant updates when annotations are involved (but not batch inserts, e.g. when managing the gazetteers). So the time lapse isn't an issue for us in practice.

That being said: I'd definitely do some things in a different way if I got to redesign the system from scratch. Most importantly, separate out the storage backend, and make this a "real" W3C-compliant annotation server. Alas, that's a task for the future ;-)

Just thought I'd mention, as it seems relevant: Our developer (at TLCMap, UON) Zongwen Fan is working on enabling users to add places not found in the gazetteer etc. The intention is these would end up in some sort of user contributed Gazetteer which could also be used in searches, and which can be exported and sent back to people who look after the cannonical gazetteer as a way of crowd sourcing identification of historical places. Zongwen noted that the places added don't go into the gazetteer but just into elastic search, and we weren't clear on what the approach should be to solve this overall issue of users needing to be able to add places that aren't already there. We'll figure something out, but just thought I'd bring it up as timely and relevant to discussion/planning on this issue.

Hi @bpascoe,

perhaps we can make a separate issue about this at some point. A possible approach would probably be to create a separate (empty) gazetteer first. Then you can add places by creating yourself an EntityImporter and using its importRecord method. This way, you don't need to deal with any of the specifics of how the gazetteer is organized internally. The place will just get added, as long as you make sure that your EntityRecord carries all the required information.

If you expose some kind of API route to the UI, you can add places individually that way.