hugheylab/pmparser

journal.pub_year empty

Closed this issue · 2 comments

Hi Josh,

Do you want PMDB to reflect the exact data from the XML files, or could pmparser supply missing data?
The only case of supplying missing data I found was: In parse_element.R (parseJournal) the missing pub_day gets the value '01' at l.252 (? I hardly know any R)

In #49 (comment) you said:

journal table - Indexing on the journal name as primary then dates as secondary could be useful in quickly querying for all articles published in a journal during a certain period of time (e.g.- determining trends or topics of papers during a few years in a single journal, etc.)

journal.pub_year is often NULL (6.735 cases in the lowest 100.000 PMIDs)

The missing journal.pub_year could be filled in with data from the medline_date (the first 4 digits).
Searching by publication year would be a lot easier.

  • No cases found in the lowest 10.000.000 PMIDs where journal.medline_date does not start with 4 digits
  • PubMed apparently works this way: searching on "2375 5472 32108" (with MedlineDate for 2375: "1975...", 5472: "1976 ..." and 32108: "1978-1979 ...") shows the first year in the facet "Results by year" (1975, 1976 and 1978).

Or is this an enhancement that you leave up to the user of PMDB.

best regards, geert

@JSchoenbachler Can you also, in the same PR, try extracting pub_year from MedlineDate in parseJournal()?

@globbestael @jakejh Functionality has been added in 71c297c , closing.