CartoDB/cartoframes

Feature Request: DO downloads should include sample option

Closed this issue · 1 comments

DO Dataset downloads can be pretty big and oftentimes people don't want the whole thing. Here's an example:
Screen Shot 2020-01-10 at 3 33 19 PM

To enable this, it'd be great to add filters and sampling options, like this:

# download 10 percent, randomly sampled
ds_sample = dataset.to_dataframe(sample_frac=0.1)
dataset.to_csv(sample_frac=0.1)

# download 1k records
ds_sample = dataset.to_dataframe(n_rows=1000)
dataset.to_csv(n_rows=1000)

Going further, it'd be really wonderful to apply filters just like we do for enrichment so that we could get only the data we need (e.g., by category, numeric range, geographic area, etc.)

Pandas has a method for sampling: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html
and filtering: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html

This could be extended to selecting only some columns in addition to applying filters to select only some rows.

cc @cmongut

This can be already done by using the new param sql_query to filter the dataset.