[DAPS, metaissue] Create integrated pipeline for data collections to ES
jaklinger opened this issue · 1 comments
jaklinger commented
- #265 Add versioning for config management
- #266 Migrate to base ES config, to remove repetition from config
- #267 Migrate to template ES mappings, to remove repetition from mappings
- [ ] #268 Create Sql2ES subclass to pick up pipeline dependencies - #269 Add new
general
endpoint (ES7) - #270 (
general
) Create GtR mapping and validate mapping for ES7 - #271 (
general
) Request validation on GtR data, then run in production mode
Prerequisite for the following:
for each (, gtr
, arxiv
, companies
, nih
#313, patstat
#315)cordis
- (
general
) Create new mapping for ES7 + validate mapping - (
general
) Request validation on data, then run in production mode, add collect - Add all
general
pipelines to weekly schedule
Regarding endpoint migration from ES6 to ES7:
- [ ] Run AWS's migration from ES6.x
to ES7.x
on eurito-dev
, and validate
- [ ] Run AWS's migration from ES6.x
to ES7.x
on health-scanner
, and validate
- Run AWS's migration from
ES6.x
toES7.x
onarxlive
, and validate
finally
- Rearrange pipelines into datasets and projects
- provide training for Luca & Seb
jaklinger commented
Items which in the end will not be addressed in this issue:
- NiH data is not added to the
general
pipeline yet, since the deduplication strategy currently relies on creating two indexes in Elasticsearch, {one for all documents including dupes and applying deduplication logic}, and {one for the deduped documents}. This doesn't fit within thegeneral
pipeline paradigm and will require significant rewiring to make this possible (of course, it is doable, but starts to fall out of scope for now). Created issue #317 eurito-dev
is deprecated in favour ofgeneral
, and so will not be upgraded to ES7health-scanner
is deprecated in favour ofhealth-mosaic
, and will be dealt with when I'm given the green light to work on that.