mcaceresb/stata-gtools

gtools does not group strL variables correctly

ractingmatrix opened this issue · 6 comments

Hi, when doing gcollapse by groups with some group being data type of "strL", the results seem to be problematic. Reproduced as follows:


clear *
set obs 4
g strL name = strofreal(round(_n,2)) 
g id = round(_n,2)
g value = _n
gcollapse (max) value = value , by(name id)

Actually, this is a limitation of the stata plugin interface version I am using. I wrote the plugin using version 2, which makes no mention of strL variables being an issue (here). However, version 3 notes that the API method to access string variables only works for non-strL variables, and they added special functions to deal with strL variables (here).

First of all, I have to throw an error if the variable type is strL. I don't think version 2.0 of the plugin can deal with it correctly. The error will note:

  • strL not supported by the Stata C API 2.0
  • Try compress varname

That's done (will merge to master soon). I now need to compile a separate version of the plugin for Stata 14 and above, but I don't own Stata 14.I might be able to do it through servers or my library's computers, so It'll take me a bit longer to fix. For now the function just throws an error for strL variables.

Actually, as I'm getting around to this I've noticed that I can't write to a strL variable, even with the 3.0 interface. I think I can implement this for most functions anyway, but, crucially, not for gcollapse (or gcontract, for that matter).

Would it even make sense to collapse a strL? One alternative is to just recast it as a str#, and if not possible (max length is too long, to just raise an error)

This is basically what happens at the moment. Though I don't recast as str# (I'm not entirely sure I should; though I suppose I could...), I just suggest the user try that.

It's just annoying that I can't entirely support strL variables because of a plugin limitation.

I guess I could add an option, force or something, to treat strL as str# wherever possible...seems like the better compromise.

This turned out to be way more complicated than I thought because I have to make a special case for when strL contains binary data; in that case all the string functions I use will give the wrong answer. I'd have to create a special array that contains the number of bytes in each entry, and change strcmp to memcmp and sprintf to snprintf for binary data.

At any rate, I will implement support for very long strings in version 0.14 but not cases when strL contains binary data (at least not yet). This will throw an error.