eBay/tsv-utils

Issue with installing on Windows 10 using D / build failure

Opened this issue · 28 comments

The build fails to install tsv-utils on "tsv-split" step. It installs the programs before this one, fails on this one and quits.
thank you

D for win error

Thanks for the report, and for trying to build tsv-utils on Windows. I don't have resources to test on Windows, so it's helpful when people try it and generate reports. These are the reasons why Unix and Mac OS are the primary platforms. I fix Windows issues if I can do it easily. In this case, tsv-split uses some Unix features, so it won't build on Windows. I have no plans to fix this at present. I may be able to remove it from the build on Windows, and I'll look into this.

My recommendations for what you can do now:

  • Try using Windows Subsystem for Linux (WSL). This is Microsoft's emulation layer intended for running Linux command line tools. I'm not aware of anyone trying to run tsv-utils on it, but it should just work. Including using the pre-built Linux binaries. The tools will be running in an environment that matches what they've been tested against.
  • Try removing tsv-split from the build in your own copy of the code. The place to do this is in dub_build.d, line 59.
  • Try using an older version of the library, prior to tsv-split being added. tsv-split was added in release v1.6.0, so you could try release v1.5.0.

If you take any of these steps please report on how it works out.

Hi Anya,

Thanks for the report. Could you repost the images in your last message? They didn't get attached correctly, and I'd like to see the details.

--Jon

Hi Jon, Please let me know if you are able to open these Thank you

No. Are you replying via email? I don't know if that will attach image. They show up as text block like [image.png]. It may be necessary to go to the GitHub web interface and upload them.

@AnnaBanana90210 - Thanks. This is very helpful. There are clearly a couple of different build issues.

JC3 commented

I just built this on win 10 with minimal issues; using 2.1.1 from dub.

The tsv-split issue was relatively easy to solve:

  1. Go into tsv-utils\tsv-split\src\tsv_utils\tsv-split.d.
  2. Go down to line 669, the rlimitCurrOpenFilesLimit() function.
  3. Blow away the entire function body and replace it with just return uint.max:
uint rlimitCurrOpenFilesLimit()
{
	return uint.max;
}

It's not technically the correct change (I don't know any D, I just installed the compiler to build these tools, so I'm not sure how to get the correct file limit with D on Windows), but it gets the job done well enough. 512 is probably a safer value on Windows.

Other than that I had no issues with a windows build via dub and the tools all seem to be working just fine.

Have you had any luck with using tsv-summarize? I get the following errors when trying to run a command (sorry for all the black-out)
tsv-summarize

JC3 commented

It's an issue with line endings. Debugging now, confidence is low due to lack of D skills, will post back.

I also had issues with using this line:
tsv-summarize --header --group-by 4 --count ^lt; test.tsv | keep-header -- sort -r -t ' ' -k2 -n | number-lines --header --s "rank" | tsv-pretty -u | keep-header – grep "United States"
I don't really see the point of using these tools on windows... time to install ubuntu.

@JC3 - Hey, thanks for looking into this.

The tsv-split issue was relatively easy to solve:

  1. Go into tsv-utils\tsv-split\src\tsv_utils\tsv-split.d.
  2. Go down to line 669, the rlimitCurrOpenFilesLimit() function.
  3. Blow away the entire function body and replace it with just return uint.max:
    ...
    It's not technically the correct change (I don't know any D, I just installed the compiler to build these tools, so I'm not sure how to get the correct file limit with D on Windows), but it gets the job done well enough. 512 is probably a safer value on Windows.

I took a look at this also, and agreed, it's not that hard to make fix. Essentially what you did, except using D version statement to do platform specific compilation. The simple form would be something like:

uint rlimitCurrOpenFilesLimit()
{
    version(Posix)
    {
        ... the current code ...
    }
    else
    {
        return 512;
    }
}

A bit better would be too generalize the API so it wasn't named rlimit, which is a Posix concept. But it would still used platform specific compilation. The 512 value you listed seems the correct value on Windows, because underneath it's the C APIs that are used. Or better, _getmaxstdio. That doesn't have an exported API in the D libraries at the moment, but the docs say 512 is the default.

@AnnaBanana90210

Have you had any luck with using tsv-summarize? I get the following errors when trying to run a command (sorry for all the black-out)

Hard to follow with all the black out, but @JC3 is most likely right. Handling of Windows CRLF line endings is untested and may well have issues in different tools. On Posix platforms there is a test to ensure Unix style LF line endings.

What I recommend doing running dos2unix or the equivalent on your files to convert the line ending to Unix style and see if that fixes the issue you are having. Of course, you'll have to decide for yourself whether it's in your interest to convert files to Unix style line endings.

By the way, what is likely happening is that the lines are ending with Windows CRLF, but line feeds are being found as LF, with the CR being left attached to the last field in the line. That's why the conversion to double is failing.

JC3 commented

@jondegenhardt

By the way, what is likely happening is that the lines are ending with Windows CRLF, but line feeds are being found as LF, with the CR being left attached to the last field in the line. That's why the conversion to double is failing.

This is exactly what appears to be happening. If I figure out a way to fix it I'll post here. Sort of flailing around in the dark trying to speed read the D API docs.

Output is fine, line endings are translated to CRLF. Input, the CR is staying attached to the last field.

JC3 commented

@AnnaBanana90210 In the meantime the dos2unix solution is working; with the general strategy being run the output of every tsv-summarize command in a pipe through dos2unix, e.g.:

tsv-summarize ... | dos2unix | tsv-summarize ... | dos2unix

It probably applies to some (or all) of the other utils too (although tsv-select works fine as-is since I guess it just passes lines through in their entirety). If you're using GnuWin32 it's in the CygUtils package.

I know that was vague. But it's late.

In the meantime the dos2unix solution is working; with the general strategy being run the output of every tsv-summarize command in a pipe through dos2unix, e.g.:

tsv-summarize ... | dos2unix | tsv-summarize ... | dos2unix

Hm... Nice that that works, but seems unacceptably inconvenient. It points out the necessity of having a comprehensive Windows test suite.

JC3 commented

Yeah; the plot twist here is AnnaBanana90210, purely by coincidence, happens to be in the same class that the person I am currently helping is in. And I think their assignment is due tomorrow. 😂

Well, if it's for a class assignment, I'd suggest trying either the Windows Subsystem for Linux (WSL) or Docker for Windows and simply run in a Unix environment. If these are options it would greatly simply things.

JC3 commented

@jondegenhardt

By the way, what is likely happening is that the lines are ending with Windows CRLF, but line feeds are being found as LF, with the CR being left attached to the last field in the line. That's why the conversion to double is failing.

Afaict, the root of the issue is at utils.d line 941:

                    /* Read the next block. */
                    _dataEnd +=
                        _file.rawRead(_buffer[_dataEnd .. _dataEnd + readSize])
                        .length;

The custom buffered line reader thing uses File.rawRead which, according to the docs:

rawRead always reads in binary mode on Windows.

Thus bypassing translation of CRLF to LF on input (and, on utils.d:909, multicharacter line endings aren't supported). How this is accomplished, I have no idea, as fread will perform the translations on text mode streams (and may return less data than requested as a consequence). Perhaps the D API sets stdin to binary mode (it opens in text mode by default on windows but is changeable)? Or something.

As an aside, of equal concern, I also noticed that named files are opened in binary mode ("rb") instead of text mode as well (line 2018 for this buffered thing, also line 1654 for the InputSource utility.

I'm not exactly sure what the best solution is, because while I could hack CRLF handling into the buffered reader thing, what really should be happening is the input streams, stdin included, being opened in text mode (can happen on all platforms btw, the "b" and "t" mode specified is ignored by fopen on Linux and Mac).

The other reason I'm not sure what the best solution is is that I can't wrap my head around what's happening under the hood, as rawRead is documented as being equivalent to fread but this represents an inconsistency in the documentation, as fread does not have the ability to "always read in binary mode under Windows"; the file mode is a file state not a per-read state.

There's also the possibility to just implement a workaround on Windows and make the buffered reading thing do some line ending translation on its own, I guess.

I tried throwing a _file.reopen(null, "rt") in there right after _file is assigned in the constructor, but it fails with "invalid file descriptor" and I didn't look into it any further.

@JC3

Afaict, the root of the issue is at utils.d line 941:

This is nice detective work, thanks. The issue though isn't in the rawRead call, it's a few lines earlier in the file, in the front routine. In particular, line 886:

   immutable end = (_buffer[_lineEnd - 1] == terminator) ? _lineEnd - 1 : _lineEnd;

This line strips the newline, if the caller asked to have newlines stripped. This could do a Windows platform check and look for a CRLF rather than just an LF. The rawRead invocation isn't the right place to do it, as it's reading a block, so multiple lines, rather than reading line-by-line.

That might handle a fair number of cases. I expect there will be other though, as there are multiple I/O mechanisms used by the different tools. It's also not obvious to me if the tools should be writing CRLFs, as tsv-summarize does, even though its a Windows platform. The issue is that if data files are being shared with others, it may well be better to write them with Unix newlines.

JC3 commented
JC3 commented

@jondegenhardt

PS

The issue is that if data files are being shared with others, it may well be better to write them with Unix newlines.

My preferred approach here, and the approach I see most often in the wild, is to have tools just consistently use the current platform's line endings on output, and if possible recognize any line ending style on input -- there's not any real reason to enforce one style of line endings on input, at most you could warn if a given source has internally inconsistent line endings as a sanity check but other than that... 🤷‍♂️. This avoids surprises and interoperability issues between programs on the same platform, and also often covers shared data implicitly. I.e. it usually just ... works; on all platforms.

Then users who do share data can just use appropriate conversion tools on shared data if the need arises, which they're probably already accustomed to anyways, instead of having to second guess how their tools are treating things.

Note also FWIW the IANA flavor of TSV data (or at least of text/tab-separated-value encodings) does not explicitly place constraints on which line ending style must be used, so regardless, for the program to at least correctly conform to that recommendation, it would need to recognize all styles independently of the current platform. For example, the current logic of explicitly rejecting Windows style linefeeds on Unix systems ends up with Unix builds potentially rejecting TSV data that is technically valid according to MIME recommendations. This actually might be the most compelling argument to make the input stream handling more robust here, if for no other reason (even excluding talk of Windows).

Just my one cent.

There are likely quite a few different working environment scenarios wrt Unix and Windows newline conventions. E.g. If you are working on a team of people with a mix of Unix and Windows environments, you're likely to want to standardize on Unix line endings. In a Windows shop perhaps the opposite. Unix line endings certainly when data is published on the web, regardless of the platform used to create it. There may be times when preserving the line ending style of the input is desired. All these scenarios can be supported, but there's a cost to doing this.

I am going to go the route of using Docker & a Linux environment. I need docker for another class anyway - thanks for all the help!!

Made a pair of changes in PR #309 to help with this:

  • Changed the README documentation to recommend WSL and Docker to Windows users.
  • Made the change to rlimitCurrOpenFilesLimit discussed above to have it return 512 on Windows.

The latter does not fully update tsv-split for Windows, which would require changing help, error messages, etc. But it should enable it to build and run. Feedback on experience folks have with this is appreciated.

The change to rlimitCurrOpenFilesLimit will be available to dub builds in the next tagged release. (dub uses the most recent tagged release, not master.) I will close this issue at the next tagged release.

(Note that dub will run against master if run from a repository directory. So, prior to the tagged release, use: git clone ...; cd tsv-utils; dub run)

I've opened a separate issue, #310, for newline handling on Windows.

Now have a Windows build as part of CI. No tests. Test suite won't pass due to both missing test suite tooling on Windows and Window newline consistency issues. But compile and link occurs successfully and will happen as part of pull request testing. See PR #313.

Windows build status is now tracked in issue #317.

The standard dub build should work now that the fix is part of a tagged release (v2.1.2).