Add non-geoscience example datasets
dcherian opened this issue ยท 11 comments
Xarray is a great tool for Neuroscience research since we typically gather data involving multiple dimensions (trials, days, animas, conditions etc.)
Allen Institute provides an SDK for reading and processing such data alognwith an "observatory" which contains relevant data (https://allensdk.readthedocs.io/en/latest/)
Hello @rsatapat, can we add a subset of the data to xaray-data for future tutorials? Any concerns regarding a subset of data being added for tutorials?
Relevant content from @jsiegle: https://xarray.dev/blog/xarray-for-neurophysiology
Just keeping a list of some other examples here
Already using Xarray:
- fish surveys! https://osoceanacoustics.github.io/echopype-examples/ms_PacificHake_EK60_cruisetracks.html
- mouse brain image stacks! https://squidpy.readthedocs.io/en/stable/notebooks/tutorials/tutorial_image_container_zstacks.html
Would require modification to use xarray instead of numpy or custom objects:
- 3D fluorescence microscopy image of cells https://scikit-image.org/docs/stable/api/skimage.data.html#skimage.data.cells3d
- Horseshoe nebula https://learn.astropy.org/tutorials/FITS-images.html
Would be interesting to look at modifying some of these examples to see if Xarray would work well in place of straight numpy arrays https://numpy.org/numpy-tutorials/ ... also it's an excellent repository overall
Brainstormed a bit more on this today with @TomNicholas. There are really two separate things to accomplish:
- Just highlight (visually) a few non-geoscience example datastructures in the tutorial and Xarray docs to make it clear that Xarray is flexible and relevant to different domains. So from the genomic surveillance example above:
- "a set of genotype calls obtained from sequencing some mosquitoes. These data can be stored as a 3-dimensional array, where one dimension of the array corresponds to positions (variants) within a reference genome, another dimension corresponds to the individual mosquitoes that were sequenced (samples), and a third dimension corresponds to the number of genomes within each individual (ploidy)." :
Note: On one hand it's nice to re-use the existing graphic and actual dataset, but could simplify even further by reducing the size, adding dimension labels to the image on the left, and dropping "alleles" and running set_index() to the dataarray on the right to easily match up!
- Bespoke formats (txt, or binary) are pervasive (not HDF,Zarr,netCDF,TIF). It would be great to add an example that coerces such a format into Xarray and does a simple useful visualization or computation.
- NumPy .npz files + metadata, which can be opened into xarray variables easily. Many people definitely still use .npz, but which example in the wild to use?
- Collection of X-ray images could work https://numpy.org/numpy-tutorials/content/tutorial-x-ray-image-processing.html, but to be really useful want to illustrate labeling (and ultimately selection) by physical coordinates so would have to invent some (patientID, x_distance(mm))
- This would segue nicely into building a custom backend docs https://tutorial.xarray.dev/advanced/backends/backends.html
https://docs.google.com/forms/d/1x9bOIelnUsDMyI1tF4bN7TWK0v4nBDiwhpxh9mi6PaI/edit#responses
One of the user survey responses specifically calls this out:
Examples with Astropy to read FITS files, using Astropy Tables
Examples with Astropy to read FITS files, using Astropy Table
Some renewed activity in this repository that seems relevant! ratt-ru/xarray-fits#26
@tomwhite mentioned that the sgkit file openers / converters are actually about to be deprecated in favour of a new package called bio2zarr
. Basically their motivation is that the text-based VCF format etc. is so awfully-designed that efficient access via a kerchunk-like approach is basically impossible, so they end up having to convert it to zarr anyway.
@tomwhite mentioned that the sgkit file openers / converters are actually about to be deprecated in favour of a new package called
bio2zarr
. Basically their motivation is that the text-based VCF format etc. is so awfully-designed that efficient access via a kerchunk-like approach is basically impossible, so they end up having to convert it to zarr anyway.
Both the VCF conversion code in sgkit and the new bio2zarr project both output the same Zarr format (specified here). The reason for bio2zarr
is that users were struggling to get the Dask-based sgkit VCF conversion working reliably, so the code was re-written to be a command-line application that runs on multi-core local machines, or HPC schedulers, and bio2zarr
is the result.
There are a couple of example sgkit tutorials that may be of interest here: https://sgkit-dev.github.io/sgkit/latest/examples/index.html