/vitessce-data

Utils for loading HuBMAP data formats

Primary LanguagePythonMIT LicenseMIT

🚄 vitessce-data

Utils to pre-process data for Vitessce.

Sample datasets come from:

JSON is our target format right now because it is easily read by Javascript, and not so inefficient as to cause problems with storage or processing. For example: The mRNA HDF5 is 30M, but as JSON it is still only 37M.

Install

vitessce-data requires Python 3. First, set up a clean environment. If you are using conda:

conda create python=3.6 -n vitessce-data
# Confirm install, then:
source activate vitessce-data

Then install dependencies with pip:

pip install -r requirements.txt
pip install -r requirements-dev.txt

Develop and run

  • test.sh exercises all the scripts, using the fixtures in fake-files/, and errors if the output is not what is expected.
  • process.sh downloads full data from the internet, caches these input files in big-files/input, processes them, caches the output in big-files/output, and pushes to S3.

process.sh only performs the work necessary. To regenerate just a portion of the data, delete the files in big-files/output that need to be replaced.