Segfault or corruption when combined with tar
ndmitchell opened this issue · 15 comments
Given the Hoogle tarball from http://hackage.haskell.org/packages/hoogle.tar.gz saved down, and the attached code, compiling with -threaded -with-rtsopts=-N4
on GHC 8.0.1 gives random data corruption. Usually it gives corrupt hashes followed by an invalid tar format error. Sometimes it gives segfaults. I have been unable to easily reproduce with either tar or zlib in isolation - my best guess is that zlib is producing buffers which aren't pinned enough, which are being collected early, and then tar chokes on them.
I observe this on other tarballs as well, but not index.tar.gz from Hackage. Hoogle just happens to be a publicly available tarball that shows the issue.
In fact, I'm suspecting this is more down to tar now, since if I add LBS.copy to the output of zlib the segfault remains, but if I add rnf body
to the tarball body the segfault disappears.
Given an alternative parallel implementation no amount of copying of deepseq'ing can remove the bug. However, that code uses more low-level (bug seemingly safe) bits.
Even more experimentation reveals that:
tarballReadFiles :: FilePath -> IO [(FilePath, LBS.ByteString)]
tarballReadFiles file = do
x <- LBS.readFile file
x <- return $ LBS.copy $ GZip.decompress x
-- evaluate $ rnf x
return $ f $ Tar.read x
where
f (Next e rest) | NormalFile body _ <- entryContent e =
let v = (entryPath e, LBS.copy body)
in rnf v `seq` v : f rest
f (Next r rest) = f rest
f Done = []
f (Fail e) = error $ "tarballReadFiles on " ++ file ++ ", " ++ show e
This segfaults. If I uncomment the rnf
of the output of decompress it works. That suggests that decompress deals poorly with being called from multiple threads. Is the real issue here that you can't call decompress on multiple threads because there is thread-local state? Given that the LBS.copy is in place I can't see how there are two simultaneous calls into zlib, but I may well be wrong.
I should say, all failures are on Windows.
Below is a simpler test case which only depends on zlib. Using 64bit GHC 8.0.1 on Windows, with the command line ghc --make parallel.hs -rtsopts -with-rtsopts=-N4 -threaded && parallel
and the 00-index.tar.gz file from Hackage I variously get segfaults (90% of the time), or errors about invalid bit length repeats etc. All point at data corruption.
import Data.IORef
import Control.Exception
import Control.Monad
import Control.Concurrent
import qualified Data.ByteString.Lazy as LBS
import Codec.Compression.GZip as GZip
main = do
putStrLn "Starting..."
xs <- LBS.toChunks . GZip.decompress <$> LBS.readFile "index.tar.gz"
sequenceP 2 [do evaluate x; threadDelay 1 | x <- xs]
sequenceP :: Int -> [IO ()] -> IO ()
sequenceP n xs = do
s <- newIORef xs
threads <- replicateM n $ forkIO $ forever $
join $ atomicModifyIORef s $ \todo ->
case todo of
[] -> ([], return ())
t:odo -> (odo, t)
threadDelay $ 10 * 1000000
@ndmitchell Thanks for the analysis.
So here is what I suspect is happening. I think it is indeed that we have parallel threads evaluating the output chunks at the same time, and running the state-mutating and zlib code concurrently. So why would this happen, or what would we normally expect to stop this? We would normally expect unsafePerformIO
or unsafeInterleaveIO
to use the noDuplicate
combinator to add mutual exclusion on evaluating the thunks in parallel. The current implementation does not use unsafePerformIO
or unsafeInterleaveIO
however. It puts the zlib primitives in the strict ST monad, and provides a non-lazy incremental interface to run zlib streams. The lazy ByteString wrapper is implemented using the lazy ST monad, using the strict ST operations via strictToLazyST
.
So where did we cheat and do something unsafe and miss out the noDuplicate
? Are all lazy ST computations really safe to evaluate concurrently? Is it only in lifting IO
to ST
that we need to use noDuplicate
?
@ndmitchell since you've got everything set up to test, try this and see if it helps, in Codec/Compression/Zlib/Internal.hs
adjust these defs by adding noDuplicate
:
runStreamST strm zstate = strictToLazyST (noDuplicateST >> Stream.runStream strm zstate)
runStreamIO strm zstate = stToIO (noDuplicate >> Stream.runStream strm zstate)
noDuplicateST :: ST s ()
noDuplicateST = unsafeIOToST noDuplicate
Bingo! With the fix described it works. Given imports, scoping, strict vs lazy, IO vs ST type clashes etc, I actually went with:
runStreamST strm zstate = strictToLazyST (Control.Monad.ST.Unsafe.unsafeIOToST noDuplicate >> Stream.runStream strm zstate)
runStreamIO strm zstate = noDuplicate >> stToIO (Stream.runStream strm zstate)
I have no real idea if the noDuplicate must be inside the stToIO
or not, or even which branch I'm relying on, but as above it works for my particular test case.
See https://ghc.haskell.org/trac/ghc/ticket/11760#comment:6 - it's a known issue in GHC - but still worth fixing in zlib. Any chance of a release containing a fix in the near future?
I wonder if my lzma
package is affected as well... :-/
I was planning to try the lzma package next anyway, sounds like all runST lazy things might be vulnerable.
@hvr it follows the same pattern iirc, so yes. Note in fact that we don't need the noDuplicate in the IO version, just the ST version.
This bug seems to have been reported in GHC's bug tracker previously, see https://ghc.haskell.org/trac/ghc/ticket/9494. After fixing this issue the test should be run as well, to check it really does cover both.
Now just need a release
Any plans for an imminent release? If not, I'll take alternative actions on my side.
@ndmitchell it's been released as http://hackage.haskell.org/package/zlib-0.6.1.2