matloff/partools

Idea: Store metadata with files

Opened this issue · 5 comments

It might be nice if filesave stored metadata along with each data file describing column types, number of rows, presence of NA's, etc. This information could potentially be used for more efficient reads.

Where would the metadata be stored?

I was more thinking a separate file named according to the filename. Something like this, which could be called x.json:

{
    "nchunks": 17,
    "random_order": true,
    "nrows": 175,
    "colNames": ["height", "age"],
    "colClasses": ["numeric", "integer"],
    "NA_exist": false,
    "format": "text",
    "delimiter": "|",
    "chunks": {
        1: {
            "nrows": 10,
            "filename": "x1.txt"
        },
        2: {
            "nrows": 10,
            "filename": "x2.txt"
        }
        ....
        17: {
            "nrows": 5,
            "filename": "x2.txt"
        }
    }
}

This leaves you free to cat the files back together.

Using save() also sounds like a good idea since it saves the work of having to parse the text twice. Then one would have several chunks in .Rdata files. Going beyond that, you could even let the user choose their serialization format for the chunks, ie. feather.

With any of these I think that having metadata stored separately as above would be useful, since we can very cheaply read the metadata and get some notion of how to efficiently perform the computation.

Long term I'm thinking about making computations lazy, so that one can analyze the R code together with the data sizes and come up with a potentially more efficient execution. This is along the lines of other systems like Spark and dask, which don't do anything until one calls compute().

But this is more ambitious, and could be even be a different project that uses partools as a dependency.

We could have just one file for that, by the way, non-distributed.

Yes, that's what I had in mind also.

The error message from Travis is because it doesn't pass R CMD check. I'll look at it now.