samplics-org/samplics

Add datasets referenced in tutorial notebooks as part of the repo

Closed this issue · 2 comments

rchew commented

I was trying to follow the examples in the tutorials, and noticed that several of the datasets referenced weren't available. This made it tough to replicate the notebook outputs and follow along.

For example, I wasn't able to find any of the following datasets when searching through the repo:

  • psu_frame.csv
  • expenditure_on_milk.csv
  • countycropareas.csv
  • countycropareas_means.csv
  • nhanes2f.csv
  • nmihs_bs.csv
  • nhanes2brr.csv
  • nhanes2jknife.csv

Perhaps a nice way of making these available to users would be to create a datasets module like they do in the sklearn library (example):

from samplics.datasets import load_psu_frame

psu_frame = load_psu_frame()
psu_frame.head(25)

Thanks for all your hard work on this project!

Yes, I agree that it will improve the experience. I will add a datasets module.

Added a module to allow users to load the tutorial data.

   import samplics

   # Import appropriate class.
   from samplics.datasets import load_psu_frame

   # Load the dataset and its metadata into the dictionary psu_frame_dict
   psu_frame_dict = load_psu_frame()

   # Store the datasets in the variable psu_frame (optional)
   psu_frame = psu_frame_dict["data"]