Create Elasticsearch database with literature sample
FrancescoCasalegno opened this issue · 1 comments
FrancescoCasalegno commented
Context
- We want to have a first version of the Elasticsearch literature database.
- It's OK even if we don't handle all sources yet (e.g. arXiv takes huge disk space to download, PMC changes baseline on an irregular basis), our goal here is to have a MVP that we and scientists can interact with.
- Once we have ingested literature into this database, we can:
- update our Search widget to interact with the database
- let scientists use the Search widget and provide feedback on results
Actions
- Use our
bbs_database
commands to download + process + ingest all articles fromPubMed
,bioRxiv
,medRxiv
. - Take topics into accounts: only relevant articles with relevant topics should be parsed and ingested, and info on topics should be preserved in our database (see #619).
- If we need disk storage, we can clean up the raw downloaded files once we uploaded all the data in the database.
- Check logs, and determine how long this process takes.
Dependencies
FrancescoCasalegno commented
Planning 2022-10-04
- Instead of trying to download the entire set of neuroscientific literature papers published until now (which is particularly complex due to the fact that topic filtering requires a configuration file, see #640, and also this would be a lot of papers at the same time), we will just start working with a few dozen of papers to see what comes out of this.
- After discussing with the scientists, they are fine with this.