stuchalk/scidata

Large data file support for scidata?

Closed this issue · 3 comments

This project is awesome!

We're looking at using scidata with a project at ORNL to capture data for diffraction, spectroscopy, and simulations. However, our data files are on the order of a 1GB for a small file. We would like to capture the metadata in one file and the bulk data in a fast binary format. The idea is that we would have the metadata file point to the large binary file. Is there a "pointer" element in scidata or should we just make our own?

JJ Thanks for the kind comment (I have not had an (awesome!) yet). Big data files are of course a major concern in a number of different disciplines. I think that your idea of putting metadata in one file and the raw data in a fast binary format is the wisest choice given current technology. However, I would suggest that you also think about extracting important main features from spectral file (using some automated processing) and adding those to the SciData file, then there may not be a need to access the raw data. As for a pointer to the raw data file I have used the "related" keyword but it well be a good idea to add a specific link for raw data that the JSON-LD is 'about'.

On a separate note, I got an NSF grant last year https://www.nsf.gov/awardsearch/showAward?AWD_ID=1835643 where I am going to use SciData to aggregate a variety of different types of data and then put it all in a graph database. As part of the project I am develoing a python library to create SciData and am always looking for use cases. Also, if you have data that is open I can convert it and put it in the system as well and as a consequence we would have to work together to make sure the data is correctly migrated. Any interest? If so please lets take the discussion to email: schalk@unf.edu.

@stuchalk I'm in @jayjaybillings group and beginning to try out SciData on some of our current data (sorry, closed data initially). Yet, we could very easily produce open data as well and will probably need to do so for testing and future features. Could gladly share that data once we have it.

I did want to ask if the Python library / package is an open-source project? Would gladly contribute and help the development process to benefit both projects. Thanks!

Thanks @stuchalk!

I'll reach out about the topic of your second paragraph. I think we would be very interested in discussing further.