BlueBrain/Search

Run ETL pipeline and collect stats on downloads

Opened this issue · 1 comments

Context

Once we are done completing the creation of the ETL pipeline to download, filter, and parse papers from the various sources (see #562), we need to run this pipeline for the first time to ensure that everything works fine and collect statistics about the results.

Actions

  • Our ETL pipeline is not designed to handled to download papers before a certain date
    # Data conventions and formats are different prior to these dates. We
    # download only if the starting date is more recent or equal to the
    # respective threshold.
    MIN_DATE = {

    So we need to download those old files manually. Hopefully, we have to do this only once.
  • Make sure to download all files in the same location on GPFS in a well-structured way. In this sense, this issue now includes the scope of #509.
  • Define the filter_config file by talking to scientists.
    parser.add_argument(
    "filter_config",
    type=Path,
    help="""
    Path to a .JSONL file that defines all the rules for filtering.
    """,
  • Test the ETL pipeline by running on the last month (i.e. --from_date equal to the last month).
  • Collect statistics about downloaded data. In particular, for each source (arxiv, biorxiv, pmc, ...) we want to know:
    • tot n. of papers (any topic / with relevant topic)
    • n. of full-text papers (any topic / with relevant topic)
    • n. of papers by format type, e.g. pdf, xml, ... (any topic / with relevant topic)

Pubmed Analysis

Baseline files

For (half) of the baseline - 562 files:

  • 16860000 articles
  • 16859975 unique UIDs
  • 25 duplicates

Updates Files downloaded

Global numbers

For updates_files: all files between pubmed22n1115.xml.gz and pubmed22n1204.xml.gz (2021-12-13 - 2022-02-22 = 71 days):

  • 90 files
  • 1762178 articles
  • 1012859 unique IDs (57.4776 % of the articles)
  • Out of 1012859 unique IDs, 298536 are already present in the (half) baseline (29.47% - might decrease with time as the baseline was created recently (?))
  • Completely new articles: 714323 (40.54 % of the articles)
  • Articles are sometimes present into several files (one of the them has 46 copies).

What are the changes ?

Analysis between pubmed22n1124.xml (published on 2021-12-19) and pubmed22n1147.xml (published on 2022-01-11)

  • 34102 articles
  • 33692 unique UIDs (98.79 % of the total)
  • 410 duplicates, among those duplicated bids:
    • After parsing, 383 does not contain any difference in the title, in the authors list and in the abstract paragraphs
    • 14 have different titles (but only one left after lowering them - the last consisting in a change of punctuation)
    • 4 have different author lists (one correcting a typo, one has an additional author, remove Prof title, switch name and surname order)
    • 22 have different abstracts (for 21 of them, the number of paragraphs did not change, for the last one, the sentences are split into several paragraphs)