Implement glob patterns on IPFS
davidgasquez opened this issue · 4 comments
davidgasquez commented
For large datasets stored as multiple parquet/CSV files, it would be much better to have a glob pattern than to write multiple union all
.
davidgasquez commented
Could perhaps be used with an S3 interface to IPFS.
davidgasquez commented
Another alternative is to mount IPFS as a local FS directory and use that. Kubo can do that, and these other projects might help:
davidgasquez commented
In theory, it should be possible to use fsspec
IPFS implementation to initialize a PyArrow dataset. In practice, it fails. 😅
import pyarrow as pa
import pyarrow.dataset as ds
from pyarrow.fs import PyFileSystem, FSSpecHandler
import ipfsspec
import duckdb
fs = ipfsspec.IPFSFileSystem()
pa_fs = PyFileSystem(FSSpecHandler(fs))
con = duckdb.connect()
sc = pa.schema([("year", pa.int16()), ("month", pa.int16()), ("day", pa.int16())])
data_schema = pa.schema(
[
("height", pa.int64()),
("miner_id", pa.string()),
("sector_id", pa.string()),
("state_root", pa.string()),
("event", pa.string()),
("year", pa.int16()),
("month", pa.int16()),
("day", pa.int16()),
]
)
part = ds.partitioning(schema=sc, flavor="filename")
dataset = ds.dataset(
"bafybeib5yuwr3hmbhw73gizhnsl5pje3cvdogwbrtvivyg53odhsabtdwe",
filesystem=pa_fs,
format="csv",
partitioning=part,
)
davidgasquez commented
It works when using https://github.com/AlgoveraAI/ipfspy!