adriennekline/psmpy

Minimum number of treated samples

charlottelambert opened this issue · 2 comments

When my data frame has only one treated sample and at least one control sample, I'm able to use psmpy just fine. However, if I have two treated samples, I get the following error.

<ipython-input-475-f46777f657a4> in make_matches(df, treatment_key, id_key, match_columns)
      1 def make_matches(df, treatment_key='treatment', id_key='id', match_columns=['id', 'created_utc', 'treatment', 'score']):
      2     psm = PsmPy(df[match_columns], treatment=treatment_key, indx=id_key, exclude = [])
----> 3     psm.logistic_ps(balance = True)
      4     psm.knn_matched(matcher='propensity_logit', replacement=True, caliper=None)
      5     return psm.matched_ids, psm

~/.local/lib/python3.6/site-packages/psmpy/psmpy.py in logistic_ps(self, balance)
    124                     # sample a macthing 20 from the minor class
    125                     minority_sample = minority.sample(
--> 126                         n=20, random_state=self.seed)
    127                     joint_df = pd.concat(
    128                         [majority_leftover_all, minority_sample])

~/.local/lib/python3.6/site-packages/pandas/core/generic.py in sample(self, n, frac, replace, weights, random_state, axis)
   4993             )
   4994 
-> 4995         locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
   4996         return self.take(locs, axis=axis)
   4997 

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

ValueError: Cannot take a larger sample than population when 'replace=False'

I'm not sure why having one treated sample would be okay, while the error above seems to indicate I should have at least 20 treated samples? Thanks!

How many do you have in each cohort total?

In short, I have it do this so that a logistic regression curve isn't fit on a super tiny number of samples (i.e. 5) so used 20. My guess is that you do not in fact have 20 in each of your respective groups.