Documentation details for Sample Size calculation is limited

Question

Documentation details for Sample Size calculation is limited

Closed this issue 6 months ago · 4 comments

I am trying to work my way through the documentation to figure out how to create a sample that is equally distributed across stratums. If the sample size is calculated to be 200 and I have 10 stratums I want 20 selected for each stratum. I have tried to follow the example in the documentation for calculating sample size but I am failing to understand how this is applied to a frame (pandas data frame) or if needs to be...or do I wrangle the data frame myself beforehand to provide things such as pop_size to the SampleSize object? I tried reading the code and found reference to other methods like 'equal' and 'total' but not sure how to use or reference. Is there something missing from documentation or could somebody walk me through a bit more?

Answer 1 · 2023-12-08T11:31:32.000Z

Hi @mondjef

I am not sure what you need. If the strata have the same specifications then the sample size calculation will produce the same number for all the strata. However, if you have a given sample size that you will allocate equally then it is a allocation issue. In this case, it is very easy since it is the same number. Using a library for equal allocation seems overkill. is this what you trying to do?

Obviously, I did not understand your question. Please clarify what you need maybe with a complete (simplified) use case.

Answer 2 · 2023-12-08T13:18:37.000Z

A bit more context....
We have an existing process that uses SAS to manually calculate the sample size that is then divided equally among 20 stratum and which is also used to produce estimates (via proc surveyfreq which is Taylor linearization based), . There has been an organizational shift to move away from SAS to opensource tools such as Python/R and it is the estimation features of Samplics that lead me to this library...
Currently in Python this is the formula that is used to determine sample size:

math.ceil((z**2)*(p*(1-p))/((0.15**2)+((z**2)*(p*(1-p)))/freq)))
where z is the z-score which is set to 1.645 (alpha=0.1), p is the expected prevalence set to 0.1, and the 0.15 is the margin of error, freq = # stratum.
With respect to using and adopting Samplics, in general there is a lack of supporting documentation both in general and within the functions themselves to help understand the relationship between input parameter names used in its functions versus what they reference in literature which some times varies. Through a lot of trial and error I was able to replicate the sample size calculation overall and at the stratum level with the following:

from samplics.sampling import SampleSize
from samplics.utils.types import SizeMethod, PopParam

sample_size = SampleSize(param = PopParam.mean, method=SizeMethod.wald, strat=True)
sample_size.calculate(half_ci=0.05, alpha=0.1, target=0.90, sigma=0.1, pop_size=frame_size)

But not completely sure if I have mapped the values to the correct Samplic parameters...

For estimation, I was able to reproduce the taylor based variance estimation as what was done previously using SAS (proc surveyfreq), however it would be helpful to have more information as to what the input data would look like for proportion/mean versus total versus ratio to understand better under which circumstance each would or could be used. For this particular survey program there is a requirement to both produce estimates per cycle plus rolling estimates over a dynamic window of N cycles (i.e. last 12 cycles or 6 cycles, etc). For this reason I would like to only store the bare minimum needed to produce these estimates rather than the results of each surveyed unit (i.e. aggragated) for storage and performance reasons. Can one of the Samplics estimation methods be used against data in this format? I was thinking maybe the ratio one would work, but did not have any success....
Yes, using a library just for an equally proportioned stratified sample is a bit over kill but as I am already going to use it for the taylor based estimation aspects I figure might as well leverage the sample size calculation and weight adjustments aspects to simplify the code base overall.
Thanks for this library and thanks for the assistance.

Answer 3 · 2023-12-08T17:54:06.000Z

Certainly, the documentation is not up to the desired standard at the moment. I'm in the process of redesigning certain APIs, and as a result, I've opted to postpone the documentation for the time being. However, progress is not as swift as I'd prefer it to be.

Below, I provided some examples. Not confident, they address your issues. Do not hesitate to provide more info if I have not answered your concerned.

The input data is just vectors (numpy array or pandas series or lists or tuples). If your cycles are independent then no problem. If your estimates use previous cycles then I will need more information to understand your use case.

I updated the sample size calculation code. It's for a proportion (PopParam.pop), stratified, with precision (half confidence interval half_ci) of 0.1, etc. The number of strata is 5. But because I did not provide the information by stratum using Python dictionaries, it will consider the information to be the same for all the strata.

from samplics.sampling import SampleSize
from samplics.utils.types import PopParam, SizeMethod

sample_size = SampleSize(param=PopParam.prop, method=SizeMethod.wald, strat=True)
sample_size.calculate(half_ci=0.15, alpha=0.1, target=0.1, pop_size=100000, nb_strata=5)

sample_size.samp_size

We get the following sample sizes for the strata

{
 '_stratum_1': 11,
 '_stratum_2': 11,
 '_stratum_3': 11,
 '_stratum_4': 11,
 '_stratum_5': 11
}

Generate data for the examples below.

import numpy as np

x_sim = 5*np.random.randn(50)
y_sim = x_sim + np.random.rand(50)*10.45
w_sim = np.random.choice([4,5,10,12,15], 50)

In the examples below, I commented out some of the design information. You can use them as needed.

# Mean 

from samplics.estimation import TaylorEstimator

mean_str = TaylorEstimator(param="mean", alpha=0.1)
mean_str.estimate(
    y=y_sim,
    samp_weight=w_sim,
    # stratum=str_sim,
    # psu=psu_sim,
    remove_nan=True,
)

print(mean_str)

# Total 

from samplics.estimation import TaylorEstimator

total_str = TaylorEstimator(param="total", alpha=0.1)
total_str.estimate(
    y=y_sim,
    samp_weight=w_sim,
    # stratum=str_sim,
    # psu=psu_sim,
    remove_nan=True,
)

print(total_str)

For the ratio estimator, we need both y and x to estimate Y/X

# Ratio 

from samplics.estimation import TaylorEstimator

ratio_str = TaylorEstimator(param="ratio", alpha=0.1)
ratio_str.estimate(
    y=y_sim,
    x=x_sim,
    samp_weight=w_sim,
    # stratum=str_sim,
    # psu=psu_sim,
    remove_nan=True,
)

print(ratio_str)

Answer 4 · 2024-01-19T14:17:08.000Z

thank you @MamadouSDiallo , I was able to get everything to work using the information you provided and have tested our Python pipeline to that of what we previously had done with SAS and the results are exactly the same or very close (some rounding differences I think). I will mark this resolved and will certainly promote the library within my organization which may promote assistance with the code base.