Rebase TSV parser on CSV parser

Question

Rebase TSV parser on CSV parser

cschloer opened this issue 4 years ago · 6 comments

Overview

test.py

from dataflows import Flow, load


file_path = "/path/to/test.tsv"
flows = [load(file_path, name="res", format="tsv", skip_rows=["#"])]
print(Flow(*flows).results())

with file test.tsv:

#  This is a comment
#  
Lat	Lon
33.6062	-117.9312
33.6062	-117.9312
33.6062	-117.9312

I get the error:

Traceback (most recent call last):
  File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tabulator/stream.py", line 757, in __extract_sample
    row_number, headers, row = next(self.__parser.extended_rows)
  File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tabulator/parsers/tsv.py", line 65, in __iter_extended_rows
    for row_number, item in enumerate(items, start=1):
  File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tsv.py", line 51, in un
    if check_line_consistency(columns, values, i, error_bad_lines):
  File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tsv.py", line 84, in check_line_consistency
    raise ValueError(message)
ValueError: Expected 1 fields in line 3, saw 2

It seems like the TSV parser strictly sets the the number of fields allowed when it is initialized (https://github.com/frictionlessdata/tabulator-py/blob/master/tabulator/parsers/tsv.py#L63). Since the first item in this file is a comment with no tabs, it errors when a line shows up with a seemingly larger number of fields.

I would fall back to just using the CSV module and use \t as the delimiter (https://stackoverflow.com/questions/42358259/how-to-parse-tsv-file-with-python) but I keep getting the error "delimiter" must be a 1-character string - not sure if that a result of custom code or not.

Please preserve this line to notify @roll (lead of this repository)

Answer 1 · 2020-08-24T11:14:16.000Z

Hi @cschloer,

I can't reproduce it:

from tabulator import Stream

with Stream('tmp/issue338.tsv', headers=1, format='csv', skip_rows=['#'], delimiter='\t') as stream:
    print(stream.headers)
    print(stream.read())
# ['Lat', 'Lon']
# [['33.6062', '-117.9312'], ['33.6062', '-117.9312'], ['33.6062', '-117.9312']]

Answer 2 · 2020-08-24T14:27:45.000Z

So the original issue (not the workaround) would be reproduced as such (using format tsv):

from tabulator import Stream

with Stream('tmp/issue338.tsv', headers=1, format='tsv', skip_rows=['#']) as stream:
    print(stream.headers)
    print(stream.read())

I'm unable to reproduce my own issue with the ""\t" with dataflows and standard load processor, but I think this bug still exists (with the tsv processor).

Answer 3 · 2020-08-24T14:33:14.000Z

The 1 character string issue might be an issue with me upgrading to python 3.8 or something...

Looking at the docs it actually does specify thaet it should be a 1 character string

https://docs.python.org/3/library/csv.html#csv.Dialect.delimiter

Answer 4 · 2020-08-25T07:51:33.000Z

@cschloer I see. The underlying TSV library is not really developed so I think we need to switch TSV to Python CSV parsing. For now, I would recommend using csv format.

Answer 5 · 2020-09-26T08:12:47.000Z

MEGED into frictionlessdata/frictionless-py#398

Answer 6 · 2020-10-05T12:17:20.000Z

Just to follow back on this, I realized that some front end library I was using was changing "\t" to "\\t" before making the request to the server. Just a note that \t is now working, but it is still not possible to delimit on a mulitcharacter string.