datonic/datadex

Implement glob patterns on IPFS

davidgasquez opened this issue · 4 comments

For large datasets stored as multiple parquet/CSV files, it would be much better to have a glob pattern than to write multiple union all.

Could perhaps be used with an S3 interface to IPFS.

Another alternative is to mount IPFS as a local FS directory and use that. Kubo can do that, and these other projects might help:

In theory, it should be possible to use fsspec IPFS implementation to initialize a PyArrow dataset. In practice, it fails. 😅

import pyarrow as pa
import pyarrow.dataset as ds
from pyarrow.fs import PyFileSystem, FSSpecHandler
import ipfsspec
import duckdb

fs = ipfsspec.IPFSFileSystem()
pa_fs = PyFileSystem(FSSpecHandler(fs))

con = duckdb.connect()

sc = pa.schema([("year", pa.int16()), ("month", pa.int16()), ("day", pa.int16())])

data_schema = pa.schema(
    [
        ("height", pa.int64()),
        ("miner_id", pa.string()),
        ("sector_id", pa.string()),
        ("state_root", pa.string()),
        ("event", pa.string()),
        ("year", pa.int16()),
        ("month", pa.int16()),
        ("day", pa.int16()),
    ]
)

part = ds.partitioning(schema=sc, flavor="filename")
dataset = ds.dataset(
    "bafybeib5yuwr3hmbhw73gizhnsl5pje3cvdogwbrtvivyg53odhsabtdwe",
    filesystem=pa_fs,
    format="csv",
    partitioning=part,
)