bldavies/nberwp

Data contain duplicates

bldavies opened this issue · 1 comments

Ten titles appear twice:

library(dplyr)
library(nberwp)

papers %>%
  count(title) %>%
  count(n)
#> # A tibble: 2 x 2
#>       n    nn
#>   <int> <int>
#> 1     1 26553
#> 2     2    20

Some duplicates might be valid. Worth checking manually.

Also, two titles reference updated versions:

papers$title[grepl('W[0-9]', papers$title, ignore.case = T)]
#> [1] "State and Local Taxes and the Rate of Return on Nonfinancial Corporate Capital (revised as W0740)"
#> [2] "Wage-Employment Contracts (Replaced by W0675)"

These should probably be removed.

I did the manual checking suggested in the comment above. The NBER website states that several papers are accidental re-issues of earlier papers:

I will remove the accidental re-issues. I will also remove:

  • w0623, which was "replaced" by w0675 in the same year;
  • w9101, which appears to be a (very) minor revision of w9080 published one month later by the same authors;
  • w9694, which appears to be a re-issue of w9483 published three months later by the same author;
  • w13410, which is identical to w13409.

Some papers are revisions of earlier papers with identical titles:

I don't think it is fair to remove these revisions. First, there may be similarly substantial revisions that cannot be found by looking for duplicate titles. Second, revisions indicate continued collaboration on, and thought about, a topic, and I think this continuation is important to acknowledge within the data.

Finally, some papers have identical titles but are, in fact, different papers with different sets of authors: