donboyd5/synpuf

Comparing results from R package synthpop to our results from synthimpute

Opened this issue · 4 comments

Running the R package synthpop

I had been wondering why the early results with synthpop on a file subset seemed to do so well in comparison to the early results with random forests from synthimpute.

I have since done a few runs on the full file with synthpop.

The latest such run that I've looked at is synthpop2.csv in the Google Drive. I also created synthpop3.csv, which I think might be slightly better, but I have not looked at it yet.

Here's what I did:

  • used MARS, XTOT, S006, E00100 (AGI), and E09600 (AMT) as X variables, meaning that they are included among the predictors for every variable
  • I dropped E00100 and E09600 from the synthesized file, of course, because they will be calculated
  • MARS, XTOT, and S006 will be exactly the same in the synthesized as in PUF (I know that @feenberg has expressed some concern about XTOT - I suggest we see what our disclosure measures tell us about this, and I can also run variants that synthesize XTOT)
  • used CART on all variables except:
    -- I synthesized pensions and dividends using the ratio approach discussed in #17
    -- I made a minor mistake with E07600 (credit for prior year minimum tax) and accidentally synthesized it by random sampling
  • created a visit sequence based on the magnitudes of the absolute value of the weighted variables, in reverse order (largest variable first)
  • modified that sequence for one really problematic variable, E02000 (Schedule E net income or loss), by putting it first; I don't know why, but that improved it dramatically

The results, maybe not surprisingly, look pretty good; I put summary results in the html file eval_taxcalc_2018-12-17.html in the Google Drive folder synpuf_analyses I set up for summary results that can be public.

I ran the results through Tax-Calculator. The graph below compares taxbc (tax before credit) for synpuf6 (latest done with synthimpute and random forests), and the first two versions with the R package synthpop (synthpop1 and synthpop2):

image

Along the way, I compared some subset results to the same subset of synpuf6; often there were enormous differences, that usually were better in the CART approach. It makes me think I should add comparisons by marital status to the evaluation program I have.

In general, it seems like the synthpop/CART approach produces results closer to PUF than the random forests approach; would like to understand why. Would like to see how the results do on disclosure measures

@feenberg would you be able to run synthpop2.csv and synthpop3.csv through your tax-reform comparisons program? I didn't yet fill all of the categorical variables with values and also a few near-zero continuous vars. The ones that may give you trouble are:

DSI, f6251, MIDR, fded, EIC, n24, f2441, e09800, e09700, p08000

It would be fine to drop them. I'll address them in a later run.

Many thanks. I have put "synscore synpuf6 vs synthpop2-wgt.xlsx" in the synpuf_analyses Google Drive folder. It compares the synpuf6 results (random forests, Python) with the synthpop2 results (CART, R).

Here are the highlights:

  • Here are some medians across all of the comparisons:

                                                 synpuf6           synthpop2
    

median(errpc) 38.9 3.1
median(abs(errpc)) 38.9 9.7
AGI errpc 12.5 1.2
tax after credit errpc 24.8 4.5
Taxable IRA dist errpc 168.1 13.3
SALT errpc 3.0 0.7

I thought the table above would work but it does not. Sorry, it will require some mental parsing.

I've copied a scatterplot of abs(errpc) below.

The good news is that the synthpop CART approach appears to yield significant improvement, especially on some important variables. The bad news is (a) it still has some big errors, and (b) we have to investigate the potential disclosure trade-off. Hopefully this provides some useful diagnostic information for @MaxGhenis.

image

  1. Comparison of synpuf8 and synthpop3 (for file descriptions see https://docs.google.com/spreadsheets/d/1qTQJd2DGMm5zXnFxyP2Rw-8rszOkLNFrobg-NIikIsw)
  2. Updated evaluation programs - @andersonfrailey , the updates are pushed to GitHub. It is more modular and substantially improved.

The two files are fairly close to apples to apples. Both include AGI and AMT as predictors. synpuf8 uses random forests and synthpop3 uses CART.

I put the results of this comparison, eval_2018-12-18-12_21.html, in the public folder.

It includes comparisons of unweighted correlations of variables, weighted sums of selected variables, plots of cumulative distribution functions of weighted values, and a summary measure of the CDFs. I did something quick here - for each file and variable, computed the cumulative percent of the weighted values, summarized each into 1,000 ntiles, then calculated the Komogorov-Smirnov p-statistic for each relative to the puf, and ranked them by the p-statistic. @MaxGhenis, would be good to talk as I am not sure this is a great measure and is not very discriminating. Perhaps we can talk about quantile loss.

Here is an ordering of some of the worst correlations with wages:

image

Here are a few weighted sums:

image

Here are some of the K-S p-statistics:

image

Both files appear to be very good on wages, so here is a CDF of interest income:
image

My own assessment is that CART is still doing better than random forests, but they are much closer now. You can look at the html file and draw your own conclusions. I think we still need a lot more information. I will work to improve this.