datacommonsorg/data

question: contributing (larger) data sets / MCF file size?

Opened this issue · 3 comments

I wonder how to potentially contribute some larger-scale datasets to data commons.
https://github.com/datacommonsorg/data/blob/master/docs/life_of_a_dataset.md states that either directly or indirectly the MCF file needs to be produced (and I guess materialized somewhere).

Compared to i.e. a (partitioned) parquet file the MCF Format seems to not offer predicate pushdown or efficient compression.
How does data commons handle larger datasets so far? Is there a way to keep a parquet file and have some mapping (logic) and not (also not eventually) materialize the MCF when ingesting the new dataset into the graph?

I.e. here


Node: l:MNHOWN_E1_R2
typeOf: dcs:StatVarObservation
variableMeasured: dcid:Count_HousingUnit_OwnerOccupied_AsFractionOf_Count_HousingUnit_OccupiedHousingUnit
observationAbout: l:MNHOWN_E0_R2
observationDate: 1985-01-01
value: 70.0

from one line in the original CSV 6 rows are created.

Assuming all the variables would be named according to their dcids and following the standards of data-commons in a (large) parquet file. Would it be sufficient to have tmcf mapping file? But this still would mean that somewhere (implicitly) the MCF needs to be created and stored?

If I understand this correctly from the example here:

that the aforementioned issue for the TMCF does not seem to be relevant.

But or the MCF: https://github.com/datacommonsorg/data/blob/master/scripts/fbi/hate_crime/testdata/aggregations_expected/aggregation.mcf#L23 where unique values are rendered as nodes in the graph I still feel stuck.
Would it be sufficient to only provided the TMCF (i.e. the mapping on the level of the metadata (not the rows / MCF)?

pradh commented

Hello there! Sorry for the delay in responding.

You're right that the MCF format with its text form is not very efficient. But it is very human readable/editable, and that is helpful for the "schema" part of the graph.

So the inputs provided by the contributor are:

  1. TMCF + CSV. This is for the data part (stats or otherwise).
    • For certain large datasets though (e.g., temperature projections) CSV is also not efficient, and in future, we would like to support other more compact formats than just CSV (for example, netcdf).
  2. MCF. This is for the StatisticalVariable/schema part.

As you note, the downstream system still needs to render "graph nodes". For that, we have a protobuf based MCF representation for statistics, which essentially "packs" a time-series of stats in a single proto message. But that is somewhat of an implementation detail.

Does that help answer your question?

Do you have documentation/ code to learn more about the protobuf rendering/packing of the graph nodes?

Assuming end users want to query (larger) datasets perhaps via the python /pandas API: how is this single proto-packed message going to be efficient? In particular, how does it compare to a parquet file (where the column headers follow the schema graph outlined in a TMCF but actual queries could be performed in regular MPP-style databases.