Implement glob patterns on IPFS

Question

Implement glob patterns on IPFS

davidgasquez opened this issue 2 years ago · 4 comments

For large datasets stored as multiple parquet/CSV files, it would be much better to have a glob pattern than to write multiple union all.

Answer 1 · 2022-06-20T07:10:15.000Z

Could perhaps be used with an S3 interface to IPFS.

Answer 2 · 2022-09-08T07:34:58.000Z

Another alternative is to mount IPFS as a local FS directory and use that. Kubo can do that, and these other projects might help:

Answer 3 · 2022-09-08T07:37:52.000Z

In theory, it should be possible to use fsspec IPFS implementation to initialize a PyArrow dataset. In practice, it fails. 😅

import pyarrow as pa
import pyarrow.dataset as ds
from pyarrow.fs import PyFileSystem, FSSpecHandler
import ipfsspec
import duckdb

fs = ipfsspec.IPFSFileSystem()
pa_fs = PyFileSystem(FSSpecHandler(fs))

con = duckdb.connect()

sc = pa.schema([("year", pa.int16()), ("month", pa.int16()), ("day", pa.int16())])

data_schema = pa.schema(
    [
        ("height", pa.int64()),
        ("miner_id", pa.string()),
        ("sector_id", pa.string()),
        ("state_root", pa.string()),
        ("event", pa.string()),
        ("year", pa.int16()),
        ("month", pa.int16()),
        ("day", pa.int16()),
    ]
)

part = ds.partitioning(schema=sc, flavor="filename")
dataset = ds.dataset(
    "bafybeib5yuwr3hmbhw73gizhnsl5pje3cvdogwbrtvivyg53odhsabtdwe",
    filesystem=pa_fs,
    format="csv",
    partitioning=part,
)

Answer 4 · 2022-09-26T14:32:17.000Z

It works when using https://github.com/AlgoveraAI/ipfspy!