BlueBrain/Search

Create Elasticsearch database with literature sample

FrancescoCasalegno opened this issue · 1 comments

Context

  • We want to have a first version of the Elasticsearch literature database.
  • It's OK even if we don't handle all sources yet (e.g. arXiv takes huge disk space to download, PMC changes baseline on an irregular basis), our goal here is to have a MVP that we and scientists can interact with.
  • Once we have ingested literature into this database, we can:
    • update our Search widget to interact with the database
    • let scientists use the Search widget and provide feedback on results

Actions

  • Use our bbs_database commands to download + process + ingest all articles from PubMed, bioRxiv, medRxiv.
  • Take topics into accounts: only relevant articles with relevant topics should be parsed and ingested, and info on topics should be preserved in our database (see #619).
  • If we need disk storage, we can clean up the raw downloaded files once we uploaded all the data in the database.
  • Check logs, and determine how long this process takes.

Dependencies

Planning 2022-10-04

  • Instead of trying to download the entire set of neuroscientific literature papers published until now (which is particularly complex due to the fact that topic filtering requires a configuration file, see #640, and also this would be a lot of papers at the same time), we will just start working with a few dozen of papers to see what comes out of this.
  • After discussing with the scientists, they are fine with this.