📑 Load any newline-separated file as a generator.
Currently supporting files located in local filesystem, HTTP(s) endpoint, S3 URI and plaintext, CSV and JSONL formats. Alternatively, you can setup your own virtual URLs.
Install and update using pip:
pip install unilist
from unilist import Unilist
lines = list(Unilist('./file.txt')))
print(lines)
csv = list(Unilist('https://example.com/file.csv'))
print(csv)
# requires Unilist.setup({ ... })
# or /usr/local/bin/aws
records = list(Unilist('s3://example/file.jsonl.gz'))
print(records)
If you don't mind extra dependency (boto3), install with
pip install unilist[boto3]
Example setup
Unilist.setup({
's3': {
'aws_access_key': '___your_access_key___',
'aws_secret_access_key': '___your_secret_key___',
},
})
Alternatively, you can provide a path to aws
binary.
Unilist.setup({
's3': {
'aws_bin': '/usr/local/bin/aws'
}
})
import pandas as pd
df = pd.DataFrame(Unilist('vfs://path/to/file.jsonl'))
Unilist.setup({
's3': {
'cache_dir': '/tmp',
'aws_access_key': '___your_access_key___',
'aws_secret_access_key': '___your_secret_key___',
},
'virtual': {
'vfs': '/custom/root/path',
'c4': './local/c4',
},
'http': {
'headers': {
'accept': 'text/plain',
},
'encoding': 'utf-8',
},
'jsonl': {},
})
git clone git@github.com:petlack/unilist.git
pip install -e .
pip install -e .[boto3]
pipenv run pytest