Rebase TSV parser on CSV parser
cschloer opened this issue · 6 comments
Overview
test.py
from dataflows import Flow, load
file_path = "/path/to/test.tsv"
flows = [load(file_path, name="res", format="tsv", skip_rows=["#"])]
print(Flow(*flows).results())
with file test.tsv:
# This is a comment
#
Lat Lon
33.6062 -117.9312
33.6062 -117.9312
33.6062 -117.9312
I get the error:
Traceback (most recent call last):
File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tabulator/stream.py", line 757, in __extract_sample
row_number, headers, row = next(self.__parser.extended_rows)
File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tabulator/parsers/tsv.py", line 65, in __iter_extended_rows
for row_number, item in enumerate(items, start=1):
File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tsv.py", line 51, in un
if check_line_consistency(columns, values, i, error_bad_lines):
File "/home/conrad/.virtualenvs/laminar/lib/python3.8/site-packages/tsv.py", line 84, in check_line_consistency
raise ValueError(message)
ValueError: Expected 1 fields in line 3, saw 2
It seems like the TSV parser strictly sets the the number of fields allowed when it is initialized (https://github.com/frictionlessdata/tabulator-py/blob/master/tabulator/parsers/tsv.py#L63). Since the first item in this file is a comment with no tabs, it errors when a line shows up with a seemingly larger number of fields.
I would fall back to just using the CSV module and use \t
as the delimiter (https://stackoverflow.com/questions/42358259/how-to-parse-tsv-file-with-python) but I keep getting the error "delimiter" must be a 1-character string
- not sure if that a result of custom code or not.
Please preserve this line to notify @roll (lead of this repository)
Hi @cschloer,
I can't reproduce it:
from tabulator import Stream
with Stream('tmp/issue338.tsv', headers=1, format='csv', skip_rows=['#'], delimiter='\t') as stream:
print(stream.headers)
print(stream.read())
# ['Lat', 'Lon']
# [['33.6062', '-117.9312'], ['33.6062', '-117.9312'], ['33.6062', '-117.9312']]
So the original issue (not the workaround) would be reproduced as such (using format tsv):
from tabulator import Stream
with Stream('tmp/issue338.tsv', headers=1, format='tsv', skip_rows=['#']) as stream:
print(stream.headers)
print(stream.read())
I'm unable to reproduce my own issue with the ""\t"
with dataflows and standard load processor, but I think this bug still exists (with the tsv processor).
The 1 character string issue might be an issue with me upgrading to python 3.8 or something...
Looking at the docs it actually does specify thaet it should be a 1 character string
https://docs.python.org/3/library/csv.html#csv.Dialect.delimiter
@cschloer I see. The underlying TSV library is not really developed so I think we need to switch TSV to Python CSV parsing. For now, I would recommend using csv
format.
MEGED into frictionlessdata/frictionless-py#398
Just to follow back on this, I realized that some front end library I was using was changing "\t" to "\\t" before making the request to the server. Just a note that \t is now working, but it is still not possible to delimit on a mulitcharacter string.