The file is not in csv format
zhenpingfeng opened this issue · 3 comments
zhenpingfeng commented
The file is not in csv format, but is stored with a csv suffix, and some data is omitted.
aws s3 sync --request-payer requester s3://carbonbot/monthly/parsed .
head -n2 binance.inverse_future.l2_event.BTC.USD.BTCUSD_211231.2021-07.csv
timestamp snapshot asks bids seq_id prev_seq_id
1625097600016 false [[36224.3,0.690144461,25000.0,250.0],[36228.1,0.104890955,3800.0,38.0],[36234.1,0.0,0.0,0.0]] [[36205.3,0.005524053,200.0,2.0],[36211.2,0.0,0.0,0.0],[36211.3,0.0,0.0,0.0],[36212.6,0.0,0.0,0.0]]
it missing seq_id and prev_seq_id data.
What is the way to read these 'csv' in best practice?
zhenpingfeng commented
okay, it seems the format is actually tsv. close
soulmachine commented
Some messages don't have seq_id
and prev_seq_id
, so there will be empty tabs at the end the line.
You can easily parse each line like the following:
line = '1625097600016 false [[36224.3,0.690144461,25000.0,250.0],[36228.1,0.104890955,3800.0,38.0],[36234.1,0.0,0.0,0.0]] [[36205.3,0.005524053,200.0,2.0],[36211.2,0.0,0.0,0.0],[36211.3,0.0,0.0,0.0],[36212.6,0.0,0.0,0.0]] '
arr = line.split('\t')
assert len(arr) == 6
soulmachine commented
okay, it seems the format is actually tsv. close
Right, tsv
would be more precise, but csv is widely used and it can use tab as the delimiter, so I adopted .csv
as the file suffix.