-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collapsing strings #25
Comments
Well, I wrote the entire code base assuming sources for summary stats had to be numeric. I note in the docs that gcollapse can't do this, but let me know if it should be more prominently stated. That is, I wrote the C base and the internals assuming the sources are numeric, and only after noticed they could be string for some commands. I'm not sure I can be persuaded to change it, as it would involve a non-trivial amount of work for what I see as little gain, but feel free to make a case for it. |
Oh, I agree it's not worth it, and now that I see it it's also documented on the help file. The way I see it, gtools is mostly useful with big datasets. And strings are not what you want to have with big datasets, so the usage is low. |
I had a few relevant comments for this discussion.
|
@mdroste While I have come to see the value in adding strings to gcollapse, it seems to me to be way too much work.
I think a nice compromise is perhaps writing extensions to gegen that allow strings? Let me know how that sounds, and the types of functions that would be useful. |
Hey Mauricio - I totally understand about the complete rewriting the codebase of gtools - I do not intend to relitigate the earlier discussion here. My point was basically that one can leverage the built-in Stata functions encode/decode to allow gtools to collapse strings (albeit not blazingly fast, since encode/decode are themselves a little slow) without needing to rewrite any C code at all. Does that make sense? I am not suggesting you actually do this, and I leave the cost-benefit analysis here to you since you're better informed -- I just wanted to point out that a rewrite of gtools is not necessary to allow for collapsing strings via (first), (firstnm), (last), etc. In words, gcollapse would be modified as follows: Check for numeric type variables, as before. If there are string types, make sure they are only present where sensible, i.e. (first), (last), etc. Throw an error if string variables present where type is not supported (i.e. not first or last or whatever), otherwise encode any string variables. Run gcollapse as normal on the encoded variables. Afterwards, decode any encoded variables. This isn't super-fast, but is still way faster than collapse, and importantly allows collapses that would have run before with string variables to continue to run with gcollapse. |
I wouldn't want to do the decoding/encoding internally, though. It would be much easier for me if the user did that externally (and much more transparent about what is happening; not to mention I wouldn't have to deal with a bunch of internal debugging). I'd be willing to provide bulk encode and decode utilities, if that's something you think will help? So sth like
|
Now that Stata has frames, wouldn't it be easy to allow the user to save the string-number pair in a separate frame? EG: * Create dataset
clear all
sysuse auto, clear
decode foreign, gen(str_foreign)
drop foreign
* Create category
gegen byte foreign = group(str_foreign)
* Create frame with number-string pair
frame put foreign str_foreign, into(foreign_keys)
frame foreign_keys: bys foreign: keep if _n == 1
frame foreign_keys: list
* Now we can drop the string (and merge it later if we want to)
drop str_foreign |
@sergiocorreia I won't have Stata 16 any time soon, though, so I wouldn't be able to test that out. However, it does seem like a better solution for users who can implement it. You'd need one frame per variable. I'd suggest this, though, to avoid the sort:
In your example,
|
That's even better than saving the keys (because we know they will be 1..N). One thing though is that frame put only allows |
It doesn't allow foreach var in strvar1 strvar2 ... {
local type_`var': type `var'
frame create `var'_keys
glevelsof `var', nolocal gen(uniq_) group(int_`var')
frame `var'_keys: set obs `=r(J)'
frame `var'_keys: gen `type_`var'' uniq_`var' = _frval(default, uniq_`var', _n)
drop uniq_`var'
}
// gcollapse using int_`var'
foreach var in strvar1 strvar2 ... {
gen `type_`var'' `var' = _frval(`var'_keys, uniq_`keys', int_`var')
drop int_`var'
} So clear all
sysuse auto, clear
local var make
local type_`var': type `var'
frame create `var'_keys
glevelsof `var', nolocal gen(uniq_) group(int_`var')
frame `var'_keys: set obs `=r(J)'
frame `var'_keys: gen `type_`var'' uniq_`var' = _frval(default, uniq_`var', _n)
drop uniq_`var'
gcollapse (min) int_`var', by(rep78)
gen `type_`var'' `var' = _frval(`var'_keys, uniq_`keys', int_`var')
drop int_`var' |
Just a quick question: is there a reason why we can't use strings as targets in collapse?
EG:
Returns:
On principle, some target variables (first, last, firstnm, lastnm) could be strings.
The text was updated successfully, but these errors were encountered: