Sharing data sets between chapters

Question

Sharing data sets between chapters

Opened this issue 7 years ago · 4 comments

From Debra's email: Matt Rocklin suggested using some data sets in common through the book, so feel free to coordinate with others on the project. The Dask chapter will also be written using the data and projects described in some of the other chapters.

@mrocklin do you have an overview of data sets already in use? For the SciPy chapter we'd be happy to reuse something as well.

Answer 1 · 2017-07-27T06:49:02.000Z

Cc @WarrenWeckesser @ev-br

Answer 2 · 2017-07-27T11:55:18.000Z

@mrocklin do you have an overview of data sets already in use? For the SciPy chapter we'd be happy to reuse something as well.

I personally have no exposure to what people have been doing. I like the idea of coordinating on datasets and examples, but have made no concrete steps in this direction.

Perhaps this issue is such a step? If others are around it might be interesting to list both our constraints for datasets for our sections as well as some datasets that we know about and appreciate.

For example for dask we have the following constraints:

It is useful if the data is inconveniently large, so that parallelism or off-memory approaches can be relevent.
It is useful if functions used in other examples are serializable (this is usually the case)

Datasets that we've frequently used in tutorials and examples include the following:

The NYC Taxi dataset
Various meteorology datasets, in particular ECMWF has public downloads
Airlines
...

Answer 3 · 2017-08-02T07:44:48.000Z

Perhaps this issue is such a step?

+1

For SciPy we are pretty flexible in terms of datasets to use. We do need:

time series data, for IIR/FIR functionality. EDIT: we've now adding a data set for this, pressure measurements: pressure.dat
one dataset that is large enough for using scipy.LowLevelCallable sensibly

Answer 4 · 2018-04-28T03:44:23.000Z

We're using the measles incidence dataset highlighted in the Wall Street Journal a while back in our chapter (#26), along with some NYC taxi data, if anyone wants to use those.