Run ETL pipeline and collect stats on downloads

Context

Once we are done completing the creation of the ETL pipeline to download, filter, and parse papers from the various sources (see #562), we need to run this pipeline for the first time to ensure that everything works fine and collect statistics about the results.

Actions

Our ETL pipeline is not designed to handled to download papers before a certain date

Search/src/bluesearch/entrypoint/database/download.py

Lines 30 to 33 in 5ed9701

    
           # Data conventions and formats are different prior to these dates. We 
        
           # download only if the starting date is more recent or equal to the 
        
           # respective threshold. 
        
           MIN_DATE = {

So we need to download those old files manually. Hopefully, we have to do this only once.

Make sure to download all files in the same location on GPFS in a well-structured way. In this sense, this issue now includes the scope of #509.

Define the filter_config file by talking to scientists.

Search/src/bluesearch/entrypoint/database/topic_filter.py

Lines 58 to 63 in e2704e2

    
               parser.add_argument( 
        
                   "filter_config", 
        
                   type=Path, 
        
                   help=""" 
        
                   Path to a .JSONL file that defines all the rules for filtering. 
        
                   """,

Test the ETL pipeline by running on the last month (i.e. --from_date equal to the last month).
Collect statistics about downloaded data. In particular, for each source (arxiv, biorxiv, pmc, ...) we want to know:
- tot n. of papers (any topic / with relevant topic)
- n. of full-text papers (any topic / with relevant topic)
- n. of papers by format type, e.g. pdf, xml, ... (any topic / with relevant topic)

Pubmed Analysis

Baseline files

For (half) of the baseline - 562 files:

16860000 articles
16859975 unique UIDs
25 duplicates

Updates Files downloaded

Global numbers

For updates_files: all files between pubmed22n1115.xml.gz and pubmed22n1204.xml.gz (2021-12-13 - 2022-02-22 = 71 days):

90 files
1762178 articles
1012859 unique IDs (57.4776 % of the articles)
Out of 1012859 unique IDs, 298536 are already present in the (half) baseline (29.47% - might decrease with time as the baseline was created recently (?))
Completely new articles: 714323 (40.54 % of the articles)
Articles are sometimes present into several files (one of the them has 46 copies).

What are the changes ?

Analysis between pubmed22n1124.xml (published on 2021-12-19) and pubmed22n1147.xml (published on 2022-01-11)

34102 articles
33692 unique UIDs (98.79 % of the total)
410 duplicates, among those duplicated bids:
- After parsing, 383 does not contain any difference in the title, in the authors list and in the abstract paragraphs
- 14 have different titles (but only one left after lowering them - the last consisting in a change of punctuation)
- 4 have different author lists (one correcting a typo, one has an additional author, remove Prof title, switch name and surname order)
- 22 have different abstracts (for 21 of them, the number of paragraphs did not change, for the last one, the sentences are split into several paragraphs)

	# Data conventions and formats are different prior to these dates. We
	# download only if the starting date is more recent or equal to the
	# respective threshold.
	MIN_DATE = {

	parser.add_argument(
	"filter_config",
	type=Path,
	help="""
	Path to a .JSONL file that defines all the rules for filtering.
	""",