Add datasets referenced in tutorial notebooks as part of the repo

Question

Add datasets referenced in tutorial notebooks as part of the repo

Closed this issue 3 years ago · 2 comments

I was trying to follow the examples in the tutorials, and noticed that several of the datasets referenced weren't available. This made it tough to replicate the notebook outputs and follow along.

For example, I wasn't able to find any of the following datasets when searching through the repo:

psu_frame.csv
expenditure_on_milk.csv
countycropareas.csv
countycropareas_means.csv
nhanes2f.csv
nmihs_bs.csv
nhanes2brr.csv
nhanes2jknife.csv

Perhaps a nice way of making these available to users would be to create a datasets module like they do in the sklearn library (example):

from samplics.datasets import load_psu_frame

psu_frame = load_psu_frame()
psu_frame.head(25)

Thanks for all your hard work on this project!

Answer 1 · 2021-04-17T15:01:12.000Z

Yes, I agree that it will improve the experience. I will add a datasets module.

Answer 2 · 2021-04-20T12:33:19.000Z

Added a module to allow users to load the tutorial data.

   import samplics

   # Import appropriate class.
   from samplics.datasets import load_psu_frame

   # Load the dataset and its metadata into the dictionary psu_frame_dict
   psu_frame_dict = load_psu_frame()

   # Store the datasets in the variable psu_frame (optional)
   psu_frame = psu_frame_dict["data"]