The file is not in csv format

Question

The file is not in csv format

zhenpingfeng opened this issue 2 years ago · 3 comments

The file is not in csv format, but is stored with a csv suffix, and some data is omitted.

aws s3 sync --request-payer requester s3://carbonbot/monthly/parsed .
head -n2 binance.inverse_future.l2_event.BTC.USD.BTCUSD_211231.2021-07.csv

timestamp       snapshot        asks    bids    seq_id  prev_seq_id
1625097600016   false   [[36224.3,0.690144461,25000.0,250.0],[36228.1,0.104890955,3800.0,38.0],[36234.1,0.0,0.0,0.0]]  [[36205.3,0.005524053,200.0,2.0],[36211.2,0.0,0.0,0.0],[36211.3,0.0,0.0,0.0],[36212.6,0.0,0.0,0.0]]

it missing seq_id and prev_seq_id data.

What is the way to read these 'csv' in best practice?

Answer 1 · 2023-01-04T07:24:49.000Z

okay, it seems the format is actually tsv. close

Answer 2 · 2023-01-04T07:25:31.000Z

Some messages don't have seq_id and prev_seq_id, so there will be empty tabs at the end the line.

You can easily parse each line like the following:

line = '1625097600016	false	[[36224.3,0.690144461,25000.0,250.0],[36228.1,0.104890955,3800.0,38.0],[36234.1,0.0,0.0,0.0]]	[[36205.3,0.005524053,200.0,2.0],[36211.2,0.0,0.0,0.0],[36211.3,0.0,0.0,0.0],[36212.6,0.0,0.0,0.0]]	'
arr = line.split('\t')
assert len(arr) == 6

Answer 3 · 2023-01-04T07:27:31.000Z

okay, it seems the format is actually tsv. close

Right, tsv would be more precise, but csv is widely used and it can use tab as the delimiter, so I adopted .csv as the file suffix.