microsoft/WSL

`stack ghc` painfully slow

ezrosent opened this issue Β· 46 comments

  • A brief description
    Managing haskell projects with the stack tool is unusable due to how slow it is.

  • Expected results
    (from a laptop running ubuntu 16.04)

time stack ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 8.0.1

real    0m0.124s                                                                                                                                           
user    0m0.092s                                                                                                                                               
sys     0m0.036s
  • Actual results (with terminal output if applicable)
    On desktop running WSL
time stack ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 8.0.1

real    0m50.520s
user    0m0.172s
sys     1m40.547s
  • Your Windows build number
    15025
  • Steps / All commands required to reproduce the error from a brand new installation
    After installation, need stack to pull in a version of GHC. This should do the trick.
stack setup
stack upgrade --install-ghc
time stack ghc -- --version
  • Strace of the failing command
    Generating the strace output (attached) inludes a few long (multi-second) waits on FUTEX_WAIT, as well as one for mmap.
  • Required packages and commands to install
    Install stack with the standard instructions

stack_ghc_strace.txt

This is on our backlog but is unlikely to make the Creators Update. I know we're planning on looking at this soon though.

For some context, I've looked at what causes this slowdown. For some reason stack has mapped an mind-bogglingly huge region of memory (I'm talking dozens of terabytes). When we fork we walk the entire address range to set up the new process's state. We have a design that should vastly speed this up, but we're approaching "pencils down" date for Creators Update.

Gotcha, thanks for the context!

Terabytes. That's awesome. Can't wait to see it in Resource Monitor.

I assume they're doing it to manage their own heap. It's a big "MAP_NORESERVE" region which Linux seems to intelligently handle since "allocate all the things" seems to be a common paradigm.

This seems to be the related discussion over at ghc ticket 9706 here, for what it is worth. Quoth:

BTW, I found that I could mmap 100 TB with PROT_NONE (or even PROT_READ) and MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED with no measurable delay, and I can start 10000 such processes at once, so there doesn't seem to be any significant cost to setting up the page table mappings (at least on my system). Not sure why that is, exactly. The VSZ column in ps looks quite funny of course :)

So by my math that's 100Γ—1012Γ—104 = 1018 β‰… 260, which gets you in just under the wire. Or something.

Adding @stehufntdev because he's been looking into this as well.

I have encountered a similar bug with plain ghc and pandoc. Really slow just to call and print out version info. Can confirm for slow-ring Insider Build & the Windows Preview v.15.15014 (using the free VM).

I found this slowdown by installing ghc v.8.0.2 (anything v.7.10 and below was fast) or by installing pandoc v.1.18 & above. See directions to install ghc or install pandoc for testing. If needed, I can provide a simple set of commands to reproduce.

They both run similarly slow/delayed for me on both systems, but I have not seen reports from other *nix users seeing similar slowdowns, so I am guessing this is WSL related.

This does not require Stack to replicate, GHC compiler alone is enough. I'm experiencing the same dreadfully slow compiler work. Besides, programs compiled with 8.0.x are slow too.

  1. wget http://downloads.haskell.org/~ghc/8.0.2/ghc-8.0.2-x86_64-deb8-linux.tar.xz
  2. tar -xJf ghc-8.0.2-x86_64-deb8-linux.tar.xz
  3. cd ghc-8.0.2
  4. ./configure --prefix=/tmp/ghc
  5. make install
  6. time /tmp/ghc/bin/ghc -e 'putStrLn ""'

@benhillis: I believe you've bumped into GHC 8.0's new block-structured heap for 64-bit platforms. From the GHC 8.0.1 release notes:

We have a shiny new two-step memory allocator for 64-bit platforms (see Trac #9706). In addition to simplifying the runtime system’s implementation this may significantly improve garbage collector performance. Note, however, that Haskell processes will have an apparent virtual memory footprint of a terabyte or so. Don’t worry though, most of this amount is merely mapped but uncommitted address space which is not backed by physical memory.

@RyanGlScott - I suspect you are right. We need to modify the way our memory manager keeps track of uncommitted pages.

I'd be very curious to see some performance measurements on how much better their allocator performs versus raw mmap / munmap calls.

I was going to quip that curiosity too, but stuck with "awesome" instead. So you benchmark 8.0 and find out it is some percent faster than 7.0. Or just as fast, but simpler. But you end up demonstrating not much in the exercise. The Haskell guys seem okay with a hello world app asking for a terabyte of virtual memory. The Chakra guys seem okay with asking for 32GB to print hello, and if you are going to do that, [expletive], why not ask for a TB. I am still academically interested in how they arrived at 32GB. Why not 64GB or 128GB? Certainly not because "that would be crazy".

It's working code. Smart people thought it was a good idea. Shrug. What you gonna do except sigh and re-work the memory manager.

FWIW, Golang also does something similar by reserving a contiguous chunk of 512 GB of memory (see this comment).

I'm certainly not qualified enough to say how they came up with that number, other than that it's a power of two andβ€”to use their wordsβ€”"512 GB (MaxMem) should be big enough for now".

I have a workaround in the meantime, based on the discussion in https://ghc.haskell.org/trac/ghc/ticket/13304. It involves compiling your own GHC which does not utilize the large address space allocation, and then using that as the GHC for your stack builds. My workaround relies on then supplying that GHC as the common GHC for your projects. In my example, I will recompile GHC 8.0.2 using whatever ghc you already have on your system. I will also make sure that Cabal is installed using this GHC -- otherwise, installing other packages will fall under the same problem of slowness. I suggest cleaning your ~/.stack and other stack directories to make sure you don't have any GHC lying around with the large-allocation functionality.

To fix, in the bash environment, I ran

# install necessary prereqs if not there
sudo apt-get install ghc happy alex
cd
git clone -b ghc-8.0.2-release --recursive git://git.haskell.org/ghc.git ghc-8.0.2
cd ghc-8.0.2
./boot
./configure --disable-large-address-space #can set --prefix=... here
make -j8 #-j(number-of-threads)
sudo make install
sudo ln -s /usr/local/bin/ghc ~/.local/bin/ghc-8.0.2 #or wherever your prefix put the binaries
# link the rest of the binaries, like runghc, ghci, etc
# this is to make sure the "system-ghc" is properly called
echo "system-ghc: true" >> ~/.stack/config.yaml
cd
# optional Cabal and cabal-install reinstallation to conform to new ghc
stack install Cabal
stack install cabal-install

Now you can do your stack install and stack build in your projects, using the specially compiled GHC.

You can monitor the VIRT usage with something like top or htop. Try stack exec ghci and monitor VIRT before and after.

Don't you also have to recompile stack for this?

@sgraf812 In my use cases, I have not had to recompile stack. If I understand correctly, stack itself never builds anything, just calls the appropriate ghc to do so, through the project-level or system-level ghc (ghci, ghc-through-cabal, etc). This issue only appears during builds, so as long as the ghc that stack uses is fine, stack itself should be fine. Monitoring the path of the ghc binary using htop during a build step might help diagnose what ghc is being used if you still see the 1TB VIRT allocs.

@pechersky This issue affects not only GHC 8, but also anything compiled with it (stack, pandoc, etc). The official binaries provided by stack developers happen to run fine because the latest release version is built with lts-6.25 and uses ghc-7.10.3.

@TerrorJack Thank you for clarifying that for me. My work-around fixes the "stack ghc is slow" issue, as well as the @sukhmel MWE. I did rebuild Cabal in my workflow. Regarding pandoc, I would delegate to the example in their docs as "http://pandoc.org/installing.html#quick-stack-method". AFAIK stack just delegates builds to ghc, pandoc, etc, so as long as those are stack install after supplying the fixed GHC, I think you should be fine. You could also rebuild stack from source.

@pechersky Also, the stack install Cabal step is not necessary. I'm working with GHC HEAD, and directly installed ghc to ~/.stack/programs/.... (using the --prefix= flag), then compiling Haskell projects using stack work out of the box. I guess regular GHC releases shall work the same.

@TerrorJack The stack install Cabal outside of a project was in case someone wanted to use stack solver, which falls back to Cabal to inspect the .cabal file, calculate the build plan, and so on.

@pechersky stack solver uses cabal-install (by invoking it and parsing the output). So in fact we need stack install cabal-install (or installing cabal-install by some other means)

I have updated the code above to include your suggestion, @TerrorJack. According to https://docs.haskellstack.org/en/stable/faq/#what-is-the-relationship-between-stack-and-cabal, both the lib (... Cabal) and the executable (... cabal-install) are used. To be on the safe side, one could (re)install both.

@pechersky, please correct if I'm wrong but I had to apt install ghc happy alex before following your instructions. That's because compiling ghc-8.0.2 required a ghc compiler to be installed, which I didn't totally understand. Should WSL users looking for a workaround then also install these Ubuntu packages?

@majorgreys ghc before 8.0.1 is not affected by this issue.

... and the GHC versions in the official apt repository are quite 'old' (< 7.10), IIRC.

@majorgreys yes you have to install those, as you would need ghc anyway. i assumed people had those installed since those would have been necessary. I'll edit the code to include those.

@stehufntdev has checked in some major improvements, see this comment for more details.

For the stack ghc -- --version command, the runtime improved from 116 seconds to 13 seconds on my test vm. Similar to nvm, we understand where the remaining time is being spent and are tracking additional work to bring this down closer to native Linux speeds.

Thanks, @stehufntdev, @benhillis, for tracking this issue. It's the only blocker for me doing Haskell development on WSL instead of a VM.

I'm with @AaronFriel , I'm doing my Haskell development in a Vagrant VM all the time. With this I'd switch completely to WSL and would be very happy πŸ˜„

@stehufntdev, @benhillis: I understand recent changes significantly improved the runtime of simple programs compiled with GHC or Go, and perhaps some of those changes are already in flight. Is the fixinbound label to suggest that we should expect performance parity with a Linux kernel?

@AaronFriel the fix inbound label is for the improvement mentioned above (i.e. 116 to 13 seconds). Should be much less painful :), and we are tracking the work to get this closer to native Linux.

Thank you! I look forward to seeing those improvements land.

Anyone have any pre-compiled disable-large-address-space'd ghc's they are willing to upload/host? :) building on my machine has been taking a while (almost a day now) and was just wondering if anyone would be willing to upload a version that they already were able to compile! :)

@mstksg I don't know if that is an option to you, but I switched to docker to compile my haskell projects for now. It's not ideal (it involves a lot of copying between the host and the VM) but at least the compile time is more reasonable.

Performance is much better in 16215 for what it's worth. But probably still not where you want to be if you compile Haskell all day long.

@pierrebeaucamp - Or non-desktop Ubuntu on VirtualBox with vboxsf, if you find all the copying a pain. vboxsf is the spiritual equivalent of DrvFS (/mnt/c) on WSL. Or automate the copying with rsync, which has been standard operating procedure forever.

@mstksg - I'd put up a PPA for you but I don't 'do' Haskell and would be the wrong point person. That's probably a good interim solution here though. If no one steps up in a couple of weeks ping the thread and maybe it'll do it. But first ask over in the Haskell community. It isn't strictly speaking a WSL Thing. The PPA is "GHC without large address space" (which would run on Real Linux too), not "GHC for WSL". I bet a runtime option is even possible with some refactoring and deep thought, though only academically because no one would bother with the effort. WSL will catch up eventually.

pnf commented

At this point, it probably makes sense to do all aspects of Haskell development within a linux VM.

kuon commented

I think this is the same issue, but I tried running elm-format (which is written in haskell) and it hangs on mmap(NULL, 8392704, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f884ba20000 for a minute before going on.

On the latest stable non-insider as of 31/07/2017, I noticed untars and configure scripts being extremely slow. Looking at the Task Manager, the Antimalware Service Executable seems the be the culprit, as though file writes are being inspected through that service. Both untars and configures do lots of small file writes.

Antimalware only takes ~5% of the CPU so maybe the slow part is the transfer of contents from WSL to that process. But I don't have any idea of what I'm talking about πŸ˜….

So I tried the insider preview 16251 and immediately noticed a huge difference 😍. Don't have any numbers from before to compare to, unfortunately, but it feels twice as fast. Though still a lot slower than virtualized Linux.

Here are times from WSL compared to Alpine in VirtualBox on the same machine, within a tmpfs for reads and writes:

./configure [WSL]

real    2m32.522s
user    0m12.547s
sys     1m58.063s

./configure [WSL w/ rootfs excluded in Defender]

real    2m30.279s
user    0m13.813s
sys     1m56.328s

./configure [VirtualBox]

real    0m 23.22s
user    0m 8.22s
sys     0m 1.76s

tar xf [WSL]

real    0m18.166s
user    0m0.531s
sys     0m4.047s

tar xf [VirtualBox]

real    0m 0.84s
user    0m 0.60s
sys     0m 0.80s

Sorry I don't have more to provide. I hope you will keep on improving WSL, that's excellent work on an fantastic concept.

@Roman2K - Thanks for the information. I'm glad that it's much more usable for you, but we still do have a long way to go. We're looking into ways to improve base NTFS speed to help bring Windows filesystem performance more in line with Linux.

I have anecdotally found the same problem with tar, which is (I think) separate from the huge memory allocation slowness. When I untar a large tarball (let's say 10GB) in a Linux VM, it returns almost immediately, because I have 20GB of RAM assigned to the VM and it all ends up in cache at near memcpy() speed. With WSL it seems to rate limit on writes to disk. I did not report it because I don't untar large files that often, and limiting on writes to disk is hard to prove these days without low level instrumentation (ugh, effort). But from the blinkenlights it looks like that is what's happening. It doesn't seem to be a CPU limiting thing, because of inefficient stat() calls per the git slowness complaints, say.

[edit] Another data point is sync never seems to do anything in WSL. With the same 10GB untar in a VM, sync takes countable time to flush the cache.

We have improved mmap performance further in insider build 17063. I believe this makes stack ghc bearable to use now :).

Thank you! I can attest to the significant improvement.

Another anecdote: I opted into 17074 hoping to get acceptable working conditions, but stack setup took exactly an hour to complete (even with windows defender temporarily disabled). For comparison, stack setup under cmd.exe starting from scratch took ~4 minutes. I'll give working with the result a shot, but it doesn't look promising.

Sorry for the negative report; keep doing great work, you'll get there. ❀️

hvr commented

In the hopes this may be useful to somebody here: I've set up a GHC PPA optimised for WSL (i.e. built with --disable-large-address-space) over at

https://launchpad.net/~hvr/+archive/ubuntu/ghc-wsl

It should merely be a matter of

sudo add-apt-repository ppa:hvr/ghc-wsl
sudo apt-get update
sudo apt-get install ghc-8.2.2-prof cabal-install-head

and then simply prepending /opt/ghc/bin/ to your $PATH env-var.

I would like to use VSCode on Windows + WSL + Stack ghci but due to this problem, it is really slow.

I will check if recompiling a custom ghc without large address space allocation is better or not. Thanks @pechersky !

nb: GHC is "still slow" (as it were) but this was deemed fixedininsiderbuilds back in July 2017, and finally made its way into the April Update.