Put data somewhere else
Closed this issue · 4 comments
From black_holes_backend created by sauln : codeforgoodconf/black_holes_backend#7
Currently, all the data is stored directly in the git repo. As we get more data to train on, this will become too big. The data should be stored somewhere that can be retrieved via a wget
command.
- Consolidate raw data in a single zip
- Store the data somewhere (George Mason servers? AWS free-tier?)
- Develop setup script to pull data from serve and run basic preprocessing.
sauln:
How expensive are the various preprocessing steps? Would it be better to store a processed version of the data instead of the original data?
frankamp:
There are conflicts between the data we have now as well (fits that appear in neg/pos and/or unknown-w-HE2). My rec: Start by deleting it all from both repos. Request all new fits in the three categories.
The preprocessing we do for ml isn't appropriate for visualization so it would be a tertiary intermediate format. I think there is no point in transforming for storage savings.
There is also the argument that a ml engine could find another supporting set of wavelengths or find a mitigating quality factor in the other flags that appear in the fits format.
Finally GM or whoever is working on similar work should have all these fits on disk already and if not find a group to share with as an unmodifiable source set.
Barring that, Amazon has open data publishing for free on S3, convince them of the big data potential.
sauln:
So all the data we have now is junk?
Sounds like getting S3 space would require a written proposal? Sean already has a few pending, so maybe that will be made available to us in the future.
In slack Matthew ( @simian201 ) mentioned drive or dropbox as an immediate solution to get the data out of the repo.
Is there a sense how much storage is needed, both in the short and long terms?
Probably a couple of gigs for algorithm development. Production data could easily be 100x that though.