[FeatureRequest] - data by URI (parquet format)

Question

[FeatureRequest] - data by URI (parquet format)

Opened this issue 2 months ago · 2 comments

EG: hosting the data on s3, as a parquet format

either as a entire collection (x,y,z,n,n1,.....nz) - browser to 'select the column/attribute'
or as
spatial index (xyz) (uri) + data column (uri) - if we already separated the contents into multiple parquet files.

load the parquet, view the data attribute.

Answer 1 · 2024-09-30T09:42:05.000Z

Hi,
That comes in two parts.

~ ~ ~
Load data by URI should have worked (subject of course to CORS), eg by pasting the uri string onto the running system.
(eg paste the string https://files.rcsb.org/download/2ayo.pdb
or using startdata
(eg run https://sjpt.github.io/xyz/xyz.html?&startdata=https://files.rcsb.org/download/2ayo.pdb

I introduced a bug when I set up some local proxy experiments so that wasn't working; I have now fixed that.

~ ~ ~
parquet format I wasn't aware of that format; I've been doing other things more recently. With a very quick superficial look it appears to be doing something similar to our tdata format; aim of efficient storage and fast read speed, using column based with separated metadata. Parquet has the big benefit of being standard; and also having row filtering as well as column.

I'll ask my son about it, he is actively working with visualization of big biological data sets (many times bigger than tdata would sensibly handle, but parquet with row filtering probably would). I'm not sure exactly the status of the work. It may be that parts of it are open source and would be a better starting point than our now somewhat out of date (but still very fast) xyzviewer. I'll check.

I'll have a bit more of a dig. As long as there is a suitable library it shouldn't be difficult to add parquet to xyzviewer, but I probably won't get time to do it.

Answer 2 · 2024-09-30T10:08:22.000Z

Look at https://github.com/Taylor-CCB-Group/MDV to see if it might suit.
According to Gemini it does support Parquet, but that is not correct, though they are considering it.
(By chance the group was discussing discussing Parquet when I messaged my son about it.)
They are using HDF5 and Zarr.

We had xyzviewer embedded in MDV some years ago, but I don't think that version became public, partly because of the use of our non-standard tdata format for larger datasets. We could use other formats, but at that time all the other available formats had absurdly huge load time for the larger datasets.