/python-rep-resampling

This takes any Pandas or Dask dataframe and returns a resampled Dask dataframe simulating the sampling distribution of your data in one line of code. This is like the rep_sample_n() function from the infer package in R, but on steroids and made for quickly simulating a large number of replicate samples and even with a large number of observations per sample rep. The dataframe it returns consists of 'n' observations per rep, 'rep' number of reps and is grouped by rep. Any aggregate operations you perform such as df['column'].mean().compute() or df['column'].std().compute() will run in parallel by default and give you an pandas series consisting of the means of each sample replicate. You can do most anything on this that you can with a Pandas DataFrame that is grouped by the same column. You just have to add the .compute() method to your method call, because this runs on futures parallelization. See the excerpts in the examples.

Primary LanguageJupyter Notebook

No issues in this repository yet.