Data contain duplicates

Question

Data contain duplicates

bldavies opened this issue 5 years ago · 1 comments

Ten titles appear twice:

library(dplyr)
library(nberwp)

papers %>%
  count(title) %>%
  count(n)
#> # A tibble: 2 x 2
#>       n    nn
#>   <int> <int>
#> 1     1 26553
#> 2     2    20

Some duplicates might be valid. Worth checking manually.

Also, two titles reference updated versions:

papers$title[grepl('W[0-9]', papers$title, ignore.case = T)]
#> [1] "State and Local Taxes and the Rate of Return on Nonfinancial Corporate Capital (revised as W0740)"
#> [2] "Wage-Employment Contracts (Replaced by W0675)"

These should probably be removed.

Answer 1 · 2020-02-03T02:10:06.000Z

I did the manual checking suggested in the comment above. The NBER website states that several papers are accidental re-issues of earlier papers:

w2432 (of w2412);
w7044 (of w6965);
w7565 (of w5901);
w8649 (of w8635).

I will remove the accidental re-issues. I will also remove:

w0623, which was "replaced" by w0675 in the same year;
w9101, which appears to be a (very) minor revision of w9080 published one month later by the same authors;
w9694, which appears to be a re-issue of w9483 published three months later by the same author;
w13410, which is identical to w13409.

Some papers are revisions of earlier papers with identical titles:

w0740 (of w0508);
w5609 (of w4600);
w6753 (of w5945);
w8110 (of w7286);
w8822 (of w7406);
w10071 (of w8937);
w10417 (of w10126);
13804 (of w12814);
w21421 (of w19155).

I don't think it is fair to remove these revisions. First, there may be similarly substantial revisions that cannot be found by looking for duplicate titles. Second, revisions indicate continued collaboration on, and thought about, a topic, and I think this continuation is important to acknowledge within the data.

Finally, some papers have identical titles but are, in fact, different papers with different sets of authors:

w0243 and w8203 (both titled "Taxation and Corporate Financial Policy");
w4372 and w17968 (both titled "Corruption");
w6841 and w11511 (both titled "Inequality");
w8829 and w1864 (both titled "Tax Incidence");
w20796 and w24521 (both titled "Forward Guidance").