Minimum number of treated samples
charlottelambert opened this issue · 2 comments
charlottelambert commented
When my data frame has only one treated sample and at least one control sample, I'm able to use psmpy
just fine. However, if I have two treated samples, I get the following error.
<ipython-input-475-f46777f657a4> in make_matches(df, treatment_key, id_key, match_columns)
1 def make_matches(df, treatment_key='treatment', id_key='id', match_columns=['id', 'created_utc', 'treatment', 'score']):
2 psm = PsmPy(df[match_columns], treatment=treatment_key, indx=id_key, exclude = [])
----> 3 psm.logistic_ps(balance = True)
4 psm.knn_matched(matcher='propensity_logit', replacement=True, caliper=None)
5 return psm.matched_ids, psm
~/.local/lib/python3.6/site-packages/psmpy/psmpy.py in logistic_ps(self, balance)
124 # sample a macthing 20 from the minor class
125 minority_sample = minority.sample(
--> 126 n=20, random_state=self.seed)
127 joint_df = pd.concat(
128 [majority_leftover_all, minority_sample])
~/.local/lib/python3.6/site-packages/pandas/core/generic.py in sample(self, n, frac, replace, weights, random_state, axis)
4993 )
4994
-> 4995 locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
4996 return self.take(locs, axis=axis)
4997
mtrand.pyx in numpy.random.mtrand.RandomState.choice()
ValueError: Cannot take a larger sample than population when 'replace=False'
I'm not sure why having one treated sample would be okay, while the error above seems to indicate I should have at least 20 treated samples? Thanks!
adriennekline commented
How many do you have in each cohort total?
adriennekline commented
In short, I have it do this so that a logistic regression curve isn't fit on a super tiny number of samples (i.e. 5) so used 20. My guess is that you do not in fact have 20 in each of your respective groups.