Issue with installing on Windows 10 using D / build failure
Opened this issue · 28 comments
Thanks for the report, and for trying to build tsv-utils
on Windows. I don't have resources to test on Windows, so it's helpful when people try it and generate reports. These are the reasons why Unix and Mac OS are the primary platforms. I fix Windows issues if I can do it easily. In this case, tsv-split
uses some Unix features, so it won't build on Windows. I have no plans to fix this at present. I may be able to remove it from the build on Windows, and I'll look into this.
My recommendations for what you can do now:
- Try using Windows Subsystem for Linux (WSL). This is Microsoft's emulation layer intended for running Linux command line tools. I'm not aware of anyone trying to run
tsv-utils
on it, but it should just work. Including using the pre-built Linux binaries. The tools will be running in an environment that matches what they've been tested against. - Try removing
tsv-split
from the build in your own copy of the code. The place to do this is in dub_build.d, line 59. - Try using an older version of the library, prior to
tsv-split
being added.tsv-split
was added in release v1.6.0, so you could try release v1.5.0.
If you take any of these steps please report on how it works out.
Hi Anya,
Thanks for the report. Could you repost the images in your last message? They didn't get attached correctly, and I'd like to see the details.
--Jon
Hi Jon, Please let me know if you are able to open these Thank you
No. Are you replying via email? I don't know if that will attach image. They show up as text block like [image.png]. It may be necessary to go to the GitHub web interface and upload them.
@AnnaBanana90210 - Thanks. This is very helpful. There are clearly a couple of different build issues.
I just built this on win 10 with minimal issues; using 2.1.1 from dub.
The tsv-split issue was relatively easy to solve:
- Go into tsv-utils\tsv-split\src\tsv_utils\tsv-split.d.
- Go down to line 669, the rlimitCurrOpenFilesLimit() function.
- Blow away the entire function body and replace it with just
return uint.max
:
uint rlimitCurrOpenFilesLimit()
{
return uint.max;
}
It's not technically the correct change (I don't know any D, I just installed the compiler to build these tools, so I'm not sure how to get the correct file limit with D on Windows), but it gets the job done well enough. 512 is probably a safer value on Windows.
Other than that I had no issues with a windows build via dub and the tools all seem to be working just fine.
It's an issue with line endings. Debugging now, confidence is low due to lack of D skills, will post back.
I also had issues with using this line:
tsv-summarize --header --group-by 4 --count ^lt; test.tsv | keep-header -- sort -r -t ' ' -k2 -n | number-lines --header --s "rank" | tsv-pretty -u | keep-header – grep "United States"
I don't really see the point of using these tools on windows... time to install ubuntu.
@JC3 - Hey, thanks for looking into this.
The tsv-split issue was relatively easy to solve:
- Go into tsv-utils\tsv-split\src\tsv_utils\tsv-split.d.
- Go down to line 669, the rlimitCurrOpenFilesLimit() function.
- Blow away the entire function body and replace it with just
return uint.max
:
...
It's not technically the correct change (I don't know any D, I just installed the compiler to build these tools, so I'm not sure how to get the correct file limit with D on Windows), but it gets the job done well enough. 512 is probably a safer value on Windows.
I took a look at this also, and agreed, it's not that hard to make fix. Essentially what you did, except using D version statement to do platform specific compilation. The simple form would be something like:
uint rlimitCurrOpenFilesLimit()
{
version(Posix)
{
... the current code ...
}
else
{
return 512;
}
}
A bit better would be too generalize the API so it wasn't named rlimit
, which is a Posix concept. But it would still used platform specific compilation. The 512
value you listed seems the correct value on Windows, because underneath it's the C APIs that are used. Or better, _getmaxstdio. That doesn't have an exported API in the D libraries at the moment, but the docs say 512 is the default.
Have you had any luck with using tsv-summarize? I get the following errors when trying to run a command (sorry for all the black-out)
Hard to follow with all the black out, but @JC3 is most likely right. Handling of Windows CRLF line endings is untested and may well have issues in different tools. On Posix platforms there is a test to ensure Unix style LF line endings.
What I recommend doing running dos2unix
or the equivalent on your files to convert the line ending to Unix style and see if that fixes the issue you are having. Of course, you'll have to decide for yourself whether it's in your interest to convert files to Unix style line endings.
By the way, what is likely happening is that the lines are ending with Windows CRLF, but line feeds are being found as LF, with the CR being left attached to the last field in the line. That's why the conversion to double is failing.
By the way, what is likely happening is that the lines are ending with Windows CRLF, but line feeds are being found as LF, with the CR being left attached to the last field in the line. That's why the conversion to double is failing.
This is exactly what appears to be happening. If I figure out a way to fix it I'll post here. Sort of flailing around in the dark trying to speed read the D API docs.
Output is fine, line endings are translated to CRLF. Input, the CR is staying attached to the last field.
@AnnaBanana90210 In the meantime the dos2unix
solution is working; with the general strategy being run the output of every tsv-summarize
command in a pipe through dos2unix
, e.g.:
tsv-summarize ... | dos2unix | tsv-summarize ... | dos2unix
It probably applies to some (or all) of the other utils too (although tsv-select
works fine as-is since I guess it just passes lines through in their entirety). If you're using GnuWin32 it's in the CygUtils package.
I know that was vague. But it's late.
In the meantime the
dos2unix
solution is working; with the general strategy being run the output of everytsv-summarize
command in a pipe throughdos2unix
, e.g.:tsv-summarize ... | dos2unix | tsv-summarize ... | dos2unix
Hm... Nice that that works, but seems unacceptably inconvenient. It points out the necessity of having a comprehensive Windows test suite.
Yeah; the plot twist here is AnnaBanana90210, purely by coincidence, happens to be in the same class that the person I am currently helping is in. And I think their assignment is due tomorrow. 😂
Well, if it's for a class assignment, I'd suggest trying either the Windows Subsystem for Linux (WSL) or Docker for Windows and simply run in a Unix environment. If these are options it would greatly simply things.
By the way, what is likely happening is that the lines are ending with Windows CRLF, but line feeds are being found as LF, with the CR being left attached to the last field in the line. That's why the conversion to double is failing.
Afaict, the root of the issue is at utils.d line 941:
/* Read the next block. */
_dataEnd +=
_file.rawRead(_buffer[_dataEnd .. _dataEnd + readSize])
.length;
The custom buffered line reader thing uses File.rawRead
which, according to the docs:
rawRead always reads in binary mode on Windows.
Thus bypassing translation of CRLF to LF on input (and, on utils.d:909, multicharacter line endings aren't supported). How this is accomplished, I have no idea, as fread
will perform the translations on text mode streams (and may return less data than requested as a consequence). Perhaps the D API sets stdin to binary mode (it opens in text mode by default on windows but is changeable)? Or something.
As an aside, of equal concern, I also noticed that named files are opened in binary mode ("rb") instead of text mode as well (line 2018 for this buffered thing, also line 1654 for the InputSource
utility.
I'm not exactly sure what the best solution is, because while I could hack CRLF handling into the buffered reader thing, what really should be happening is the input streams, stdin included, being opened in text mode (can happen on all platforms btw, the "b" and "t" mode specified is ignored by fopen
on Linux and Mac).
The other reason I'm not sure what the best solution is is that I can't wrap my head around what's happening under the hood, as rawRead
is documented as being equivalent to fread
but this represents an inconsistency in the documentation, as fread
does not have the ability to "always read in binary mode under Windows"; the file mode is a file state not a per-read state.
There's also the possibility to just implement a workaround on Windows and make the buffered reading thing do some line ending translation on its own, I guess.
I tried throwing a _file.reopen(null, "rt")
in there right after _file
is assigned in the constructor, but it fails with "invalid file descriptor" and I didn't look into it any further.
Afaict, the root of the issue is at utils.d line 941:
This is nice detective work, thanks. The issue though isn't in the rawRead
call, it's a few lines earlier in the file, in the front
routine. In particular, line 886:
immutable end = (_buffer[_lineEnd - 1] == terminator) ? _lineEnd - 1 : _lineEnd;
This line strips the newline, if the caller asked to have newlines stripped. This could do a Windows platform check and look for a CRLF rather than just an LF. The rawRead
invocation isn't the right place to do it, as it's reading a block, so multiple lines, rather than reading line-by-line.
That might handle a fair number of cases. I expect there will be other though, as there are multiple I/O mechanisms used by the different tools. It's also not obvious to me if the tools should be writing CRLFs, as tsv-summarize
does, even though its a Windows platform. The issue is that if data files are being shared with others, it may well be better to write them with Unix newlines.
PS
The issue is that if data files are being shared with others, it may well be better to write them with Unix newlines.
My preferred approach here, and the approach I see most often in the wild, is to have tools just consistently use the current platform's line endings on output, and if possible recognize any line ending style on input -- there's not any real reason to enforce one style of line endings on input, at most you could warn if a given source has internally inconsistent line endings as a sanity check but other than that... 🤷♂️. This avoids surprises and interoperability issues between programs on the same platform, and also often covers shared data implicitly. I.e. it usually just ... works; on all platforms.
Then users who do share data can just use appropriate conversion tools on shared data if the need arises, which they're probably already accustomed to anyways, instead of having to second guess how their tools are treating things.
Note also FWIW the IANA flavor of TSV data (or at least of text/tab-separated-value
encodings) does not explicitly place constraints on which line ending style must be used, so regardless, for the program to at least correctly conform to that recommendation, it would need to recognize all styles independently of the current platform. For example, the current logic of explicitly rejecting Windows style linefeeds on Unix systems ends up with Unix builds potentially rejecting TSV data that is technically valid according to MIME recommendations. This actually might be the most compelling argument to make the input stream handling more robust here, if for no other reason (even excluding talk of Windows).
Just my one cent.
There are likely quite a few different working environment scenarios wrt Unix and Windows newline conventions. E.g. If you are working on a team of people with a mix of Unix and Windows environments, you're likely to want to standardize on Unix line endings. In a Windows shop perhaps the opposite. Unix line endings certainly when data is published on the web, regardless of the platform used to create it. There may be times when preserving the line ending style of the input is desired. All these scenarios can be supported, but there's a cost to doing this.
I am going to go the route of using Docker & a Linux environment. I need docker for another class anyway - thanks for all the help!!
Made a pair of changes in PR #309 to help with this:
- Changed the README documentation to recommend WSL and Docker to Windows users.
- Made the change to
rlimitCurrOpenFilesLimit
discussed above to have it return 512 on Windows.
The latter does not fully update tsv-split
for Windows, which would require changing help, error messages, etc. But it should enable it to build and run. Feedback on experience folks have with this is appreciated.
The change to rlimitCurrOpenFilesLimit
will be available to dub
builds in the next tagged release. (dub
uses the most recent tagged release, not master.) I will close this issue at the next tagged release.
(Note that dub will run against master if run from a repository directory. So, prior to the tagged release, use: git clone ...; cd tsv-utils; dub run
)
I've opened a separate issue, #310, for newline handling on Windows.
Now have a Windows build as part of CI. No tests. Test suite won't pass due to both missing test suite tooling on Windows and Window newline consistency issues. But compile and link occurs successfully and will happen as part of pull request testing. See PR #313.
Windows build status is now tracked in issue #317.
The standard dub
build should work now that the fix is part of a tagged release (v2.1.2).