eBay/tsv-utils

tsv-summarize: doesn't support windows line-ending

Halmaethor opened this issue · 6 comments

When aggregating the last column, the prebuilt binary for linux (v1.1.15) returns the following error:

' when converting from type const(char)[] to type doubleUnexpected '
File: dummy.tsv Line: 2

I've used tsv-append, tsv-uniq and tsv-select, and each works just fine. After updating line-endings to unix, tsv-summarize works.

Thanks for the report and for trying the tools.

The only tool explicitly supporting Windows line-endings on Linux is csv2tsv. You are correct, tsv-append works fine, and a couple other tools as well, but it's more accidental than by design.

What's going on is that the tools are using D standard library functions for reading lines, these functions assume unix line endings on unix platforms. If the file has Windows line ending (a \r\n pair), the line is left with an extraneous \r character at the end of the last field. If this extraneous \r interfers with processing, the tool doesn't work.

In the case of tsv-summarize and tsv-filter, if they try to interpret that the last field as numeric value, the conversion will fail. However, even if they don't perform a conversion, the tools are not necessarily working correctly. tsv-select for example, isn't really preserving Windows line endings.

As an example:

$ # This outputs the Windows line endings
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | grep $'YY\r'
AA	XX	YY
BB	XX	YY
$ # Select the last field.
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | tsv-select -f 3
YY
YY
$ # Grep shows the Windows line ending
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | tsv-select -f 3 | grep $'YY\r'
YY
YY
$ # Select the second field
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | tsv-select -f 2
XX
XX
$ # Grep shows its not a Windows line ending.
$ echo $'AA\tXX\tYY\r\nBB\tXX\tYY\r' | tsv-select -f 2 | grep $'XX\r'
$

I'm not inclined to add support for Windows line endings on Unix platforms. The dos2unix tool is a good tool for this and fits the pipeline approach being used by the tools.

However, something is going wrong with the error message formatting, and the error message should identify a Windows line ending as a likely problem. The documentation for the tools should also discuss line endings. I'll have to look into both of these.

Regarding csv2tsv - This tool explicitly supports Windows line-endings because they are commonly used in many programs that generate CSV files.

The badly formatted error message is due to the \r character being included in the error message. I'll have to fix it.

Thanks for your explanation and the great tools.

As you said, I used dos2unix to convert the line endings, so this issue wasn't deterrent to my work. Updating the documentation and especially the error message is more than enough to solve this issue.

Current plan: On Unix builds, check for Windows/DOS line endings when processing the first line of a file. That should handle most cases prior to ever hitting the error message. Regarding the poor error message format: There's an open D bug for it: https://issues.dlang.org/show_bug.cgi?id=17708

Addressed by PR #103, merged to master. Will be included in the next release.

Included in release v.1.1.16.