haskell/zlib

Segfault or corruption when combined with tar

ndmitchell opened this issue · 15 comments

Given the Hoogle tarball from http://hackage.haskell.org/packages/hoogle.tar.gz saved down, and the attached code, compiling with -threaded -with-rtsopts=-N4 on GHC 8.0.1 gives random data corruption. Usually it gives corrupt hashes followed by an invalid tar format error. Sometimes it gives segfaults. I have been unable to easily reproduce with either tar or zlib in isolation - my best guess is that zlib is producing buffers which aren't pinned enough, which are being collected early, and then tar chokes on them.

I observe this on other tarballs as well, but not index.tar.gz from Hackage. Hoogle just happens to be a publicly available tarball that shows the issue.

Test.zip

In fact, I'm suspecting this is more down to tar now, since if I add LBS.copy to the output of zlib the segfault remains, but if I add rnf body to the tarball body the segfault disappears.

Given an alternative parallel implementation no amount of copying of deepseq'ing can remove the bug. However, that code uses more low-level (bug seemingly safe) bits.

Bug.zip

Even more experimentation reveals that:

tarballReadFiles :: FilePath -> IO [(FilePath, LBS.ByteString)]
tarballReadFiles file = do
    x <- LBS.readFile file
    x <- return $ LBS.copy $ GZip.decompress x
    -- evaluate $ rnf x
    return $ f $ Tar.read x
    where
        f (Next e rest) | NormalFile body _ <- entryContent e =
            let v = (entryPath e, LBS.copy body)
            in rnf v `seq` v : f rest
        f (Next r rest) = f rest
        f Done = []
        f (Fail e) = error $ "tarballReadFiles on " ++ file ++ ", " ++ show e

This segfaults. If I uncomment the rnf of the output of decompress it works. That suggests that decompress deals poorly with being called from multiple threads. Is the real issue here that you can't call decompress on multiple threads because there is thread-local state? Given that the LBS.copy is in place I can't see how there are two simultaneous calls into zlib, but I may well be wrong.

I should say, all failures are on Windows.

Below is a simpler test case which only depends on zlib. Using 64bit GHC 8.0.1 on Windows, with the command line ghc --make parallel.hs -rtsopts -with-rtsopts=-N4 -threaded && parallel and the 00-index.tar.gz file from Hackage I variously get segfaults (90% of the time), or errors about invalid bit length repeats etc. All point at data corruption.

import Data.IORef
import Control.Exception
import Control.Monad
import Control.Concurrent
import qualified Data.ByteString.Lazy as LBS
import Codec.Compression.GZip as GZip

main = do
    putStrLn "Starting..."
    xs <- LBS.toChunks . GZip.decompress <$> LBS.readFile "index.tar.gz"
    sequenceP 2 [do evaluate x; threadDelay 1 | x <- xs]


sequenceP :: Int -> [IO ()] -> IO ()
sequenceP n xs = do
    s <- newIORef xs
    threads <- replicateM n $ forkIO $ forever $
        join $ atomicModifyIORef s $ \todo ->
            case todo of
                [] -> ([], return ())
                t:odo -> (odo, t)
    threadDelay $ 10 * 1000000

@ndmitchell Thanks for the analysis.

So here is what I suspect is happening. I think it is indeed that we have parallel threads evaluating the output chunks at the same time, and running the state-mutating and zlib code concurrently. So why would this happen, or what would we normally expect to stop this? We would normally expect unsafePerformIO or unsafeInterleaveIO to use the noDuplicate combinator to add mutual exclusion on evaluating the thunks in parallel. The current implementation does not use unsafePerformIO or unsafeInterleaveIO however. It puts the zlib primitives in the strict ST monad, and provides a non-lazy incremental interface to run zlib streams. The lazy ByteString wrapper is implemented using the lazy ST monad, using the strict ST operations via strictToLazyST.

So where did we cheat and do something unsafe and miss out the noDuplicate? Are all lazy ST computations really safe to evaluate concurrently? Is it only in lifting IO to ST that we need to use noDuplicate?

@ndmitchell since you've got everything set up to test, try this and see if it helps, in Codec/Compression/Zlib/Internal.hs adjust these defs by adding noDuplicate:

runStreamST strm zstate = strictToLazyST (noDuplicateST >> Stream.runStream strm zstate)
runStreamIO strm zstate = stToIO (noDuplicate >> Stream.runStream strm zstate)

noDuplicateST :: ST s ()
noDuplicateST = unsafeIOToST noDuplicate

Bingo! With the fix described it works. Given imports, scoping, strict vs lazy, IO vs ST type clashes etc, I actually went with:

runStreamST strm zstate = strictToLazyST (Control.Monad.ST.Unsafe.unsafeIOToST noDuplicate >> Stream.runStream strm zstate)
runStreamIO strm zstate = noDuplicate >> stToIO (Stream.runStream strm zstate)

I have no real idea if the noDuplicate must be inside the stToIO or not, or even which branch I'm relying on, but as above it works for my particular test case.

See https://ghc.haskell.org/trac/ghc/ticket/11760#comment:6 - it's a known issue in GHC - but still worth fixing in zlib. Any chance of a release containing a fix in the near future?

hvr commented

I wonder if my lzma package is affected as well... :-/

I was planning to try the lzma package next anyway, sounds like all runST lazy things might be vulnerable.

@hvr it follows the same pattern iirc, so yes. Note in fact that we don't need the noDuplicate in the IO version, just the ST version.

This bug seems to have been reported in GHC's bug tracker previously, see https://ghc.haskell.org/trac/ghc/ticket/9494. After fixing this issue the test should be run as well, to check it really does cover both.

Now just need a release

Any plans for an imminent release? If not, I'll take alternative actions on my side.