samplics-org/samplics

Question: Is it possible to draw a one-stage PPS sample? (no stratum)

Opened this issue · 3 comments

cuchoi commented

I have a list of schools I want to sample proportionally to the number of students. How would I do this?

This is the code that I am using:

# Fake data
school_pop_df = pd.DataFrame(dict(id=(1,2,3), n_students=(10, 20, 100)))
n = 2
school_pop_df['samplics_prob'] = pps_design.inclusion_probs(samp_unit=school_pop_df['id'],
                                                            samp_size=n,
                                                            mos=school_pop_df['n_students'])
  • If I leave stratum=None (the default), it throws an error:
def _anycertainty(
    210     samp_size: Union[DictStrInt, int],
    211     stratum: Optional[np.ndarray],
    212     mos: np.ndarray,
    213 ) -> bool:
    215     certainty = False
--> 216     if stratum.shape not in ((), (0,)) and isinstance(samp_size, dict):
    217         for s in np.unique(stratum):
    218             stratum_units = stratum == s

AttributeError: 'NoneType' object has no attribute 'shape'
  • If use stratum=1 then it runs and it seems accurate.

But then, to select the sample, if I try to run:

    pps_design.select(
            samp_unit=school_pop_df['id'],
            samp_size=n,
            stratum=1,
            mos=school_pop_df['n_students'])

I get that some clusters are certainties:

    770 elif _mos.shape not in ((), (0,)) and self.method in (
    771     SelectMethod.pps_brewer,
    772     SelectMethod.pps_hv,
   (...)
    775     SelectMethod.pps_sys,
    776 ):
    777     if self._anycertainty(samp_size=self.samp_size, stratum=_stratum, mos=_mos):
--> 778         raise CertaintyError("Some clusters are certainties.")
    780 _samp_ids = np.linspace(
    781     start=0, stop=_samp_unit.shape[0] - 1, num=_samp_unit.shape[0], dtype="int"
    782 )
    784 if remove_nan:

CertaintyError: Some clusters are certainties.

Hi @cuchoi

One of your cluster is much larger than the other two. Therefore, it becomes a certainty cluster meaning the probability of inclusion is 1. You will have to handle it manually. In this case, the sample is the certainty one. If you were selecting more than one, you could exclude the certainty unit from the frame as selected, and sample the rest of the units frame the remaining frame. In the future I have plans to handle this situation better but for now it is a manual process.

I will improve the code to better handle the case where stratum is None.

Best

cuchoi commented

That makes sense; thanks for the answer!

cuchoi commented

In the future I have plans to handle this situation better but for now it is a manual process

Do you have any reference implementations or papers? I could try submitting a PR.