Run ETL pipeline and collect stats on downloads
Opened this issue · 1 comments
Context
Once we are done completing the creation of the ETL pipeline to download, filter, and parse papers from the various sources (see #562), we need to run this pipeline for the first time to ensure that everything works fine and collect statistics about the results.
Actions
- Our ETL pipeline is not designed to handled to download papers before a certain date
Search/src/bluesearch/entrypoint/database/download.py
Lines 30 to 33 in 5ed9701
So we need to download those old files manually. Hopefully, we have to do this only once. - Make sure to download all files in the same location on GPFS in a well-structured way. In this sense, this issue now includes the scope of #509.
- Define the
filter_config
file by talking to scientists.
Search/src/bluesearch/entrypoint/database/topic_filter.py
Lines 58 to 63 in e2704e2
- Test the ETL pipeline by running on the last month (i.e.
--from_date
equal to the last month). - Collect statistics about downloaded data. In particular, for each source (
arxiv
,biorxiv
,pmc
, ...) we want to know:- tot n. of papers (any topic / with relevant topic)
- n. of full-text papers (any topic / with relevant topic)
- n. of papers by format type, e.g. pdf, xml, ... (any topic / with relevant topic)
Pubmed Analysis
Baseline files
For (half) of the baseline - 562 files:
- 16860000 articles
- 16859975 unique UIDs
- 25 duplicates
Updates Files downloaded
Global numbers
For updates_files: all files between pubmed22n1115.xml.gz
and pubmed22n1204.xml.gz
(2021-12-13 - 2022-02-22 = 71 days):
- 90 files
- 1762178 articles
- 1012859 unique IDs (57.4776 % of the articles)
- Out of 1012859 unique IDs, 298536 are already present in the (half) baseline (29.47% - might decrease with time as the baseline was created recently (?))
- Completely new articles: 714323 (40.54 % of the articles)
- Articles are sometimes present into several files (one of the them has 46 copies).
What are the changes ?
Analysis between pubmed22n1124.xml
(published on 2021-12-19) and pubmed22n1147.xml
(published on 2022-01-11)
- 34102 articles
- 33692 unique UIDs (98.79 % of the total)
- 410 duplicates, among those duplicated bids:
- After parsing, 383 does not contain any difference in the title, in the authors list and in the abstract paragraphs
- 14 have different titles (but only one left after lowering them - the last consisting in a change of punctuation)
- 4 have different author lists (one correcting a typo, one has an additional author, remove Prof title, switch name and surname order)
- 22 have different abstracts (for 21 of them, the number of paragraphs did not change, for the last one, the sentences are split into several paragraphs)