Sharing data sets between chapters
Opened this issue · 4 comments
From Debra's email: Matt Rocklin suggested using some data sets in common through the book, so feel free to coordinate with others on the project. The Dask chapter will also be written using the data and projects described in some of the other chapters.
@mrocklin do you have an overview of data sets already in use? For the SciPy chapter we'd be happy to reuse something as well.
@mrocklin do you have an overview of data sets already in use? For the SciPy chapter we'd be happy to reuse something as well.
I personally have no exposure to what people have been doing. I like the idea of coordinating on datasets and examples, but have made no concrete steps in this direction.
Perhaps this issue is such a step? If others are around it might be interesting to list both our constraints for datasets for our sections as well as some datasets that we know about and appreciate.
For example for dask we have the following constraints:
- It is useful if the data is inconveniently large, so that parallelism or off-memory approaches can be relevent.
- It is useful if functions used in other examples are serializable (this is usually the case)
Datasets that we've frequently used in tutorials and examples include the following:
- The NYC Taxi dataset
- Various meteorology datasets, in particular ECMWF has public downloads
- Airlines
- ...
Perhaps this issue is such a step?
+1
For SciPy we are pretty flexible in terms of datasets to use. We do need:
- time series data, for IIR/FIR functionality. EDIT: we've now adding a data set for this, pressure measurements: pressure.dat
- one dataset that is large enough for using
scipy.LowLevelCallable
sensibly