Comparing results from R package synthpop to our results from synthimpute

Question

Comparing results from R package synthpop to our results from synthimpute

Opened this issue 6 years ago · 4 comments

Running the R package synthpop

I had been wondering why the early results with synthpop on a file subset seemed to do so well in comparison to the early results with random forests from synthimpute.

I have since done a few runs on the full file with synthpop.

The latest such run that I've looked at is synthpop2.csv in the Google Drive. I also created synthpop3.csv, which I think might be slightly better, but I have not looked at it yet.

Here's what I did:

used MARS, XTOT, S006, E00100 (AGI), and E09600 (AMT) as X variables, meaning that they are included among the predictors for every variable
I dropped E00100 and E09600 from the synthesized file, of course, because they will be calculated
MARS, XTOT, and S006 will be exactly the same in the synthesized as in PUF (I know that @feenberg has expressed some concern about XTOT - I suggest we see what our disclosure measures tell us about this, and I can also run variants that synthesize XTOT)
used CART on all variables except:
-- I synthesized pensions and dividends using the ratio approach discussed in #17
-- I made a minor mistake with E07600 (credit for prior year minimum tax) and accidentally synthesized it by random sampling
created a visit sequence based on the magnitudes of the absolute value of the weighted variables, in reverse order (largest variable first)
modified that sequence for one really problematic variable, E02000 (Schedule E net income or loss), by putting it first; I don't know why, but that improved it dramatically

The results, maybe not surprisingly, look pretty good; I put summary results in the html file eval_taxcalc_2018-12-17.html in the Google Drive folder synpuf_analyses I set up for summary results that can be public.

I ran the results through Tax-Calculator. The graph below compares taxbc (tax before credit) for synpuf6 (latest done with synthimpute and random forests), and the first two versions with the R package synthpop (synthpop1 and synthpop2):

Along the way, I compared some subset results to the same subset of synpuf6; often there were enormous differences, that usually were better in the CART approach. It makes me think I should add comparisons by marital status to the evaluation program I have.

In general, it seems like the synthpop/CART approach produces results closer to PUF than the random forests approach; would like to understand why. Would like to see how the results do on disclosure measures

Answer 1 · 2018-12-17T11:25:12.000Z

@feenberg would you be able to run synthpop2.csv and synthpop3.csv through your tax-reform comparisons program? I didn't yet fill all of the categorical variables with values and also a few near-zero continuous vars. The ones that may give you trouble are:

DSI, f6251, MIDR, fded, EIC, n24, f2441, e09800, e09700, p08000

It would be fine to drop them. I'll address them in a later run.

Answer 2 · 2018-12-17T16:00:39.000Z

On Mon, 17 Dec 2018, Don Boyd wrote: @feenberg would you be able to run synthpop2.csv through your tax-reform comparisons program?

I uploaded to synthpuf_analyses but it is also attached. I will improve the column headers for the next run. "errpc" should be "synerrpc" and stderrpc should be "pufstderr". Dan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVZvTcs50Ab3IpljEcuwUkETnYV7sks5u538YgaJpZM4ZWKC4.gif]

simpute/score/synscore.sas synthpop2 2011 0. s006/1.e5 09:59 Monday, December 17, 2018 1 name tru syn errpc stderrpc star label _adctcrt 297.24 493.71 66.097 88.842 Rate for additional ctc _almsp -1058.67 -471.18 -55.493 -8.117 ** AMT bracket _amex -3220710.69 -3251803.27 0.965 -0.622 ** Personal Exemption _amtex 19877.33 16198.11 -18.510 6.920 ** AMT Exclusion _amtx25 288580.65 120075.62 -58.391 7.913 ** AMT exclusion phaseout rate _amtys 4473.35 3246.64 -27.423 7.317 ** AMT phaseout start _brk1 -380960.49 -383060.61 0.551 -0.470 ** 10% tax rate thresholds _brk2 -317943.88 -324245.33 1.982 -0.949 ** 15% tax rate thresholds _brk3 -19070.85 -21219.75 11.268 -2.309 * 25% rate thresholds _brk4 -10898.04 -13447.97 23.398 -4.003 * 28% rate thresholds _brk5 -1601.26 -1928.86 20.459 -6.636 * 33% rate thresholds _cgrate1 538785.59 505836.17 -6.115 3.921 ** Initial rate on long term gains _cgrate2 3991256.94 4179917.79 4.727 26.298 Normal rate on long term gains _chmax 2387038.67 2349678.29 -1.565 1.611 ** Max Child Tax Credit per child _cphase -14683.84 -14786.07 0.696 -3.756 ** Child Tax Credit Phase-Out _crmax 307290.89 271982.26 -11.490 3.185 ** Maximum earned income credit _rt1 10633875.95 10621609.55 -0.115 0.458 ** 10% rate _rt2 20523461.97 20703784.84 0.879 0.632 ** 15% rate _rt3 9084172.90 9578326.61 5.440 1.280 ** 25% rate _rt4 2836492.07 3234265.52 14.023 2.751 * 28% rate _rt5 2231544.20 2703906.69 21.168 4.693 * 33% rate _rt6 5141693.02 5427644.80 5.561 18.483 35% rate _rtbase -2109.53 -3507.05 66.248 -18.412 * EIC base rate _rtless 218802.48 198003.89 -9.506 4.160 ** EIC phaseout rate _ssb50 -47416.63 -44374.45 -6.416 -2.289 ** SS 50% taxability threshold _ssb50r 83687.72 78696.72 -5.964 2.487 ** _ssb85 -20663.16 -20440.34 -1.078 -3.245 ** SS 85% taxability threshold _ssb85r 442502.34 458941.99 3.715 2.185 ** _stded -816331.78 -873526.12 7.006 -0.747 ** Basic Standard Deduction _tamtr2 1277041.34 904350.27 -29.184 11.997 AMT surtax rate _tamtr26 3937338.62 2100966.86 -46.640 5.807 ** AMT basic rate _ymax -32150.10 -28092.11 -12.622 -3.490 ** Start of EIC phaseout e00200 -742349851.14 -748644292.50 0.848 -3.207 ** Salaries and wages e00300 -24069819.43 -28159840.99 16.992 -19.727 * Interest received e00400 -238695.85 -161867.33 -32.187 -12.196 ** Tax-exempt interest income e00600 -36523739.75 -38331907.24 4.951 -16.009 ** Dividends included in AGI e00700 -5728375.40 -6477705.29 13.081 -10.465 * State income tax refunds e00800 -929689.78 -1509775.06 62.396 -18.114 * Alimony received e00900 -38352345.52 -44004674.10 14.738 -13.143 * Business or profession (Schedule C) net profit/lo e01200 1347976.65 -3419994.32 -353.713 228.618 Other gains (or loss) (+/-) e01400 -42926287.12 -48646270.91 13.325 -6.830 * Taxable IRA distribution e01700 -91017095.96 -92646666.06 1.790 -2.425 ** Pensions and annuities included in AGI e02000 -133041384.43 -129880859.02 -2.376 -18.021 ** Schedule E net income or loss (+/-) e02100 2011680.21 689845.41 -65.708 63.568 Schedule F net profit/loss (+/-) e02300 -8398789.23 -11118105.50 32.377 -3.095 * Unemployment compensation in AGI e02400 -38537843.35 -39756618.26 3.163 -2.021 ** Gross Social Security benefits e03150 2273539.21 2343549.34 3.079 5.190 ** Total deductible IRA payments e03210 1497042.76 2163811.88 44.539 2.761 * Student Loan Interest Deduction e03220 190583.56 208413.18 9.355 3.484 ** Educator Expenses e03230 516192.44 758538.04 46.949 6.749 * Tuition and Fees Deduction e03240 2521594.78 3245770.92 28.719 41.316 Domestic Production Activities deduction e03270 5328176.39 5536597.61 3.912 4.909 ** Self-employed health insurance deduction e03290 741617.07 865705.59 16.732 8.186 * Health Savings Account deduction e03300 5817693.18 6207158.60 6.694 10.843 Payments to KEOGH accounts e03400 68984.45 115100.38 66.850 69.782 Forfeited interest penalty e03500 1776740.98 1992772.19 12.159 13.852 Alimony paid e07240 742921.77 724277.75 -2.510 3.962 ** Retirement Savings Credit e07260 1463905.04 1494470.76 2.088 13.084 Residential Energy Credit e07300 13021213.79 18112639.32 39.101 42.154 Foreign tax e07400 2394122.54 2863313.09 19.598 62.048 General business credit e07600 497132.25 3037575.69 511.020 53.255 Credit for prior year minimum tax e09700 -17985.42 -46132.95 156.502 -182.462 * Recapture taxes e09800 -5979.21 -7664.55 28.187 -37.397 * Social security tax on tip income e09900 -5558980.68 -5839434.76 5.045 -6.113 ** Penalty tax on IRA e17500 8568370.93 7833966.75 -8.571 5.059 ** Medical and dental expenses e18400 65262884.73 65737346.82 0.727 7.560 ** State and local taxes e18500 33466810.95 33644699.55 0.532 1.852 ** Real estate tax deductions e19200 71167350.14 71421369.33 0.357 2.236 ** Total interest paid deduction e19800 29162297.92 28867198.64 -1.012 13.445 Cash contributions e20100 6958712.67 7174094.97 3.095 43.214 Other than cash contributions e20400 11992777.60 14021416.58 16.916 6.450 * Miscellaneous deductions subject to AGI limitatio e24515 -841388.30 -948492.89 12.730 -53.404 * Unrecaptured Section 1250 gain e24518 -414632.07 -435958.84 5.144 -251.009 ** 28% Rate Gain or Loss e27200 -65.53 -138.22 110.916 -196.603 * e32800 2537341.86 2564334.39 1.064 3.212 ** Form 2441 Qualifying individuals' Expenses e58990 -126713.82 -159385.36 25.784 -63.781 * Investment income (Form 4952 part 2 line 4g) simpute/score/synscore.sas synthpop2 2011 0. s006/1.e5 09:59 Monday, December 17, 2018 2 name tru syn errpc stderrpc star label e62900 9502869.46 163996.43 -98.274 43.3239 Alternative tax foreign tax credit e87521 13880883.34 13583317.57 -2.144 2.4519 ** American Opportunity Credit e87530 1958714.67 1766585.42 -9.809 4.8501 ** Lifetime Learning Total Qualified Expenses totc00100 8516623460.89 8618939598.29 1.201 0.6102 ** Total AGI totc10300 1035492731.25 1081675219.15 4.460 0.8489 ** Tax after credits totwgt 204519.44 204519.44 0.000 0.1095 ** Number of returns z07180 2543380.12 2570604.02 1.070 3.2090 ** Child and dependent care z07200 773.91 608.89 -21.323 81.3249 Elderly or disabled z07220 813224.50 888444.01 9.250 8.0966 ** Child Tax Credit z07230 15908083.93 15428468.84 -3.015 2.2283 ** Education Credits z07240 743441.10 726357.33 -2.298 3.9602 ** Retirement Savings Credit z07300 13698431.62 18161529.44 32.581 40.3077 Foreign tax z07600 497132.25 3037575.69 511.020 53.2551 Credit for prior year minimum tax z11070 93355.98 119024.91 27.496 19.4643

Answer 3 · 2018-12-17T17:17:37.000Z

Many thanks. I have put "synscore synpuf6 vs synthpop2-wgt.xlsx" in the synpuf_analyses Google Drive folder. It compares the synpuf6 results (random forests, Python) with the synthpop2 results (CART, R).

Here are the highlights:

Here are some medians across all of the comparisons:

                                             synpuf6           synthpop2

median(errpc) 38.9 3.1
median(abs(errpc)) 38.9 9.7
AGI errpc 12.5 1.2
tax after credit errpc 24.8 4.5
Taxable IRA dist errpc 168.1 13.3
SALT errpc 3.0 0.7

I thought the table above would work but it does not. Sorry, it will require some mental parsing.

I've copied a scatterplot of abs(errpc) below.

The good news is that the synthpop CART approach appears to yield significant improvement, especially on some important variables. The bad news is (a) it still has some big errors, and (b) we have to investigate the potential disclosure trade-off. Hopefully this provides some useful diagnostic information for @MaxGhenis.

Answer 4 · 2018-12-18T18:19:48.000Z

Comparison of synpuf8 and synthpop3 (for file descriptions see https://docs.google.com/spreadsheets/d/1qTQJd2DGMm5zXnFxyP2Rw-8rszOkLNFrobg-NIikIsw)
Updated evaluation programs - @andersonfrailey , the updates are pushed to GitHub. It is more modular and substantially improved.

The two files are fairly close to apples to apples. Both include AGI and AMT as predictors. synpuf8 uses random forests and synthpop3 uses CART.

I put the results of this comparison, eval_2018-12-18-12_21.html, in the public folder.

It includes comparisons of unweighted correlations of variables, weighted sums of selected variables, plots of cumulative distribution functions of weighted values, and a summary measure of the CDFs. I did something quick here - for each file and variable, computed the cumulative percent of the weighted values, summarized each into 1,000 ntiles, then calculated the Komogorov-Smirnov p-statistic for each relative to the puf, and ranked them by the p-statistic. @MaxGhenis, would be good to talk as I am not sure this is a great measure and is not very discriminating. Perhaps we can talk about quantile loss.

Here is an ordering of some of the worst correlations with wages:

Here are a few weighted sums:

Here are some of the K-S p-statistics:

Both files appear to be very good on wages, so here is a CDF of interest income:

My own assessment is that CART is still doing better than random forests, but they are much closer now. You can look at the html file and draw your own conclusions. I think we still need a lot more information. I will work to improve this.