NickCH-K/did

Group is a string variable

Closed this issue · 33 comments

Attempting att_gt (haven't gotten it to work yet) on an ever-simplifying version of my model. My most recent recurring error is: "Group is a string variable" in mkmat. My gvar is an int var, so I don't believe the error is what it suggests.

A possibly relevant detail is I kept running into an "Inf not found" error with call_return.ado before I added a line in the ado to bypass it (https://github.com/haghish/rcall/pull/17/files/1a97cbb7aa01c1b803a967952149655a6b4e554c#). However, a basic rcall regression is returning results.

Why might this Group string error arise?

Can you post minimal code/data that will produce the errors? It will be very difficult for me to track the bug down otherwise.

My guess is that the regression output includes NAs in the results, and rcall is having trouble sending those results back to Stata. Is this correct? Or are you not getting any results to screen at all?

Also, if your group var has a value label on it, try removing it and see if that works.

Sample code:
att_gt y time group x_1 x_2 x_3, idname(id) anticipation(2) biters(10) clustervars(id)

id, group, time are integers.
y, x_1, x_2, x_3 are double.
None contains missing values. None has value labels.

I suspect the same as you -- the output is NAs and stata isn't recognizing it. I just don't know why that would be or how to check if that is the case.

I am not getting any results, just the error:
string variables not allowed in varlist;
Group is a string variable
r(109);

Other comments:
I'm hitting rcall clear after each attempt since it tells me to redo didsetup otherwise, and I run into issues when I redo didsetup so I've learned to avoid it or just uninstall/reinstall everything when I can't.

And in my blue R screen it tells me that there are 50 or more warnings before it crashes into the Group is a string variable error.

When I run your code with the following DGP I get no error. That's telling me there's something going on in your data that doesn't fit.

clear

set obs 500

* Outcome and covariates
g y = rnormal()
g x_1 = rnormal()
g x_2 = rnormal()
g x_3 = rnormal()

* ID and time indicators
g id = floor((_n-1)/5)
g time = mod(_n-1, 5)

* First period treated should be the same for all obs in id
g group = floor(runiform()*5)
sort id time
replace group = group[_n-1] if id == id[_n-1]
* Untreated groups
replace group = 0 if id < 25

att_gt y time group x_1 x_2 x_3, idname(id) anticipation(2) biters(10) clustervars(id)

Some notes:

  1. Try rcall: summary(CS_Model). If you see the output, that means the model ran correctly but there was an error in the process of sending it back to Stata
  2. If it DOES work, try rcall: table[['Group']] to see the variable it seems to be having trouble with
  3. rcall has "sticky" errors sometimes where once an error pops up it will continue to pop up for everything you do even if there's not an actual error. rcall clear won't fix it, you have to restart Stata.
  4. Check the gvar to make sure every value is a valid time period or 0.

This is promising. rcall: summary(CS_Model) returned a table with the Group and Time columns complete. Most groups have ATTs and se's, but the singular-N groups are straight NAs (makes sense). However, none have confidence bands.

Great, so the model is running. You might want to drop those singletons, that might be the only thing gumming it up.

I also just now updated to handle models with lots of group/times a little better. Try reinstalling did and see if that helps.

Still hitting the Group is a sting error, even after the update. The CS_Model now contains confidence bands for those groups that have estimates. There are rows of NAs for those without estimates, sometimes for groups as large as 5 (though some singleton groups are reaping estimates). I'm upping my sample to see if I'm out of the woods with the large sample, but it's taking some time.

Good luck! I don't think I can help any more without being able to reproduce the error myself. If you can share the data (or a subset) that produces the error I can look again.

This last attempt gave me NAs for a group of N_g = 285. I think this is coming down to covariate selection, so I dropped all but one and was able to get results for all groups. Now I'm just hitting the matsize limit, which I think is fine. I'll upload a dummy dataset for you to troubleshoot.

Thanks for all your help, by the way. This code is awesome, and way faster than my attempts at running C&S through rcall.

Here's a sample of N=2241, G=36, T=72. One group (44) is a singleton, but many will give NAs depending on which X's are included. Could be an overlap issue for some groups? If this is unavoidable, then the error in this case ("Group is a string") could be replaced with something along the lines of "choose better X's".

I'm going to pause on this for now.

samp.TXT

It was an issue with the results table itself being too big to return, that's fixed now.

Fundamentally, Stata's just not going to like it whenever there's a results matrix with missing values, as there is in this data. So you'll get the results back, but they won't be in e(b) or e(V). Any follow-up aggte is likely to be a little shaky too.

Some good news (?) is I'm returning to the Inf error again, but I haven't traced the error yet to see what I need to fix (likely in the call_return ado). And I am attempting to migrate the code to my school's hpcc to use MP, but the default rcall settings aren't working and I'm having to alter the rcall ado setting to find R first. Is there a reason why didsetup reinstalls rcall every time instead of checking if it's installed first?

Ohp, my bad! The autoinstall confusion is a function of me using the "go" option. Disregard that last question

Some good news (?) is I'm returning to the Inf error again, but I haven't traced the error yet to see what I need to fix (likely in the call_return ado). And I am attempting to migrate the code to my school's hpcc to use MP, but the default rcall settings aren't working and I'm having to alter the rcall ado setting to find R first. Is there a reason why didsetup reinstalls rcall every time instead of checking if it's installed first?

I wasn't getting the Inf error when I ran your data with the newest version, so there may be something else going on there. Do you see results with rcall: summary(CS_Model)? If so, that's about as good as you'll get anyway; Stata's not going to let you do a lot of postestimation if your results have improper values. You can get access to the results table by writing it to file with rcall: write.csv(table, 'filename.csv')

Also, none of the calculations are actually done in Stata, they're all in R. So switching to MP might help you open a bigger data set or something, but it won't make any of the did estimations any faster.

I resolved the Inf issue (it was my fault, a typo). Got everything loaded in the hpcc, and it's nicely spitting out the temp_table_toobig csv, but not storing in e() as you said. I am re-running and attempting to gen dynamic effects w/ aggte. Should the na_rm option take care of the aggte "shaky"ness?

You'll definitely want to add na_rm if you have blank entries in your results. I'm not sure if it will fully handle the shakiness - I don't think in general the did estimator is really made to handle huuuge numbers of group/time combinations. If it looks like it ranbut you get nothing back you can again do rcall: write.csv(table, 'filename.csv')

Is there any way of getting Ns reported with the table?

If the estimation works properly and everything gets returned to Stata, N can be found in e(N). If not, rcall: CS_Model[['n']] will give the number of unique group IDs, but it doesn't store the actual number of observations.

Not sure how I'm going to get aggte working, and I can't calculate on my own without the N. Right now I'm trying to drop when time - group > max_e before I run att_gt. If that doesn't work, I'll try dropping groups where N_g < # of controls + 5. And if that doesn't work, I'll just learn R.

Nothing I've tried has avoided Missing values in estimation results; results have not been delivered to e() matrices. from att_gt and consequently (?) [1] "R failed to produce estimates, or rcall failed to return it to Stata." ATT not found from aggte, but I am getting an att_gt table automatically put into temp_table_toobig.csv.

Yeah I'm not sure. Missing values in the estimation results will screw up aggte, and can't be delivered to the e() matrices. This seems like an issue of the estimation/function not being able to produce estimates for the data/model, not a problem with the results being passed back and forth from Stata to R. If you ran the model in R directly you'd be running into the same problems.

I've pruned the data to avoid NAs in the output, but I'm still hitting the same errors. I hadn't noticed yet that my confidence bands are astronomical despite modest standard errors. Have you encountered this before?

For example:

Group Time ATTgt SE SimultCI95_Bot SimultCI95_Top
1 14 2 -0.0071 0.004897 -5.3E+09 5.29E+09
2 14 3 0.02191 0.005858 -6.3E+09 6.33E+09
3 14 4 -0.00131 0.004725 -5.1E+09 5.1E+09
4 14 5 -0.00604 0.002862 -3.1E+09 3.09E+09
5 14 6 -0.02046 0.003583 -3.9E+09 3.87E+09
6 14 7 0.012281 0.003488 -3.8E+09 3.77E+09
7 14 8 -0.0067 0.004291 -4.6E+09 4.64E+09`

Ah okay, so the zero-heavy y is the culprit. Not a ton I can do about that. I suspect the group issue has to do with little variation rather than size since the three I kicked out to avoid NAs had 2,619, 1,654, and 272 individual units. Would running again with the cband_no option help aggte find att_gt?

Oh wow, okay I was thinking these groups were not that small since they are large relative to other groups that do yield estimates for every t. For example, my smallest group contains 17 units and yields ATT_gts for all t's.

The lesser cbands at least appeared reasonable next to the standard errors, but I still can't get aggte to find my ATTs.

It does smell like lack of overlap, but that's why is puzzling to get estimates for a group of 17 but not in a group of 2619.

An update: Increasing biters solved the confidence interval issue, so simple fix.

However, I still can't get aggte to work. Even using the example data and code, I get from att_gt:

             Length Class     Mode   
group          12   -none-    numeric
t              12   -none-    numeric
att            12   -none-    numeric
V_analytical    1   dgCMatrix S4     
se             12   -none-    numeric
c               1   -none-    numeric
inffunc      6000   dgCMatrix S4     
n               1   -none-    numeric
W               1   -none-    numeric
Wpval           1   -none-    numeric
aggte           0   -none-    NULL   
alp             1   -none-    numeric
DIDparams      26   DIDparams list   

Is aggte supposed to be NULL here? Because I then get [1] "R failed to produce estimates, or rcall failed to return it to Stata." ATT not found. Is it possibly an issue with Linux compatibility?

If you're getting errors even with the example code, it's possible that rcall doesn't work properly on Linux - I haven't had a chance to test it myself.