Problem with string decoding
zudov opened this issue · 26 comments
Hello. I met a bug which drives me crazy.
The problem is that in very specific conditions code commonMarkToNode [] (T.pack "«")
fails with this error:
Cannot decode byte '\xc2': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream.
I guess that the problem is with string marshaling. Probably byte length is messed up with codepoint length somewhere.
Here is a sniplet which allows to get the same error in pure haskell.
Prelude> :m +Data.Text
Prelude Data.Text> :m +Data.Text.Encoding
Prelude Data.Text Data.Text.Encoding> let s = pack "«"
Prelude Data.Text Data.Text.Encoding> decodeUtf8 $ Data.ByteString.take (Data.Text.length s) $ encodeUtf8 $ s
"*** Exception: Cannot decode byte '\xc2': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream
Prelude Data.Text Data.Text.Encoding>
It happens only if the string contains non-ascii symbols.
To reproduce a whole problem a very specific conditions must be met. It never occurs in interactive evaluation (only when called from binary). Unimporting some modules from binary can solve the problem. And so on. I am trying to say that it's really hard to reproduce, but if it's reproduced it stably appears on every execution. In case if the error-line above wouldn't give you insight on a problem, I can capture it and ship for experiments.
+++ Konstantin Zudov [Apr 06 15 07:35 ]:
I guess that the problem is with strings marshaling.
Here is a sniplet which allows to reproduce the error in pure haskell.
Prelude> :m +Data.Text Prelude Data.Text> :m +Data.Text.Encoding Prelude Data.Text Data.Text.Encoding> let s = pack "«" Prelude Data.Text Data.Text.Encoding> decodeUtf8 $ Data.ByteString.take (Data.Text.length s) $ encodeUtf8 $ s "*** Exception: Cannot decode byte '\xc2': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream
I can see why you get the exception here.
Prelude Data.Text.Encoding Data.Text> encodeUtf8 s
"\194\171"
The utf-8 encoding for s is two bytes, 194 and 171. And, decodeUtf8 works fine on these two bytes:
Prelude Data.Text.Encoding Data.Text> decodeUtf8 $ encodeUtf8 s
"\171"
But you're interposing Data.ByteString.take (Data.Text.length s)
. The length of s is 1 -- this is the length of the text in characters (unicode code points), not bytes!
So,
Prelude Data.Text.Encoding Data.Text> Data.ByteString.take (Data.Text.length s) $ encodeUtf8 s
"\194"
And of course decodeUtf8 will raise an error on this single byte, which isn't a valid UTF-8 sequence!
@jgm Yes, this incorrect snipplet was intentionally written to demonstrate assumption about the nature of the bug. But I couldn't figure out why the same exception occurs inside commonmarkToNode
Based on the error message Cannot decode byte '\xc2'
that is exactly what happens inside commonmarkToNode
, decodeUtf8
tries to decode '\194' instead of "\194\171".
If I had to guess, I'd guess the problem might be with this definition:
fromtext :: Text -> CString
fromtext t = io $! withCString (unpack t) return
In other places we use Data.Text.Foreign.withCStringLen
-- it's worth a try rewriting this function to use that. (I can't recall why I didn't in the first place, but it may be that withCStringLen
doesn't return a pointer to a null terminated string?)
@jgm Unfortunately the latest commit didn't help. So I tried to make a demo for you.
No matter how I tried, I couldn't reproduce it with a standalone piece of code, so I had to prepare a full repo with a travis build that demonstrates a failure.
Primarily, there are three files that you will be interested in:
- commonmark-parser/tests/Bug.hs -- demonstrates a failure
- commonmark-parser/tests/BugDisappear1.hs, and
- commonmark-parser/tests/BugDisappear2.hs -- demonstrate strange cases in which bug disappears
A travis install script builds all dependencies and installs git version of cmark-hs
. It then executes a demonstration script.
The three files that I mentioned above, are compiled as test-suites (standalone binaries).
Then, travis.sh
runs Bug.hs
's binary (bug
). This fails with "Cannot decode byte..." error message. However if the same file is interpreted with runhaskell
the commonmarkToNode
works OK and produces expected result.
Then travis.sh
demonstrates this small crazy things that scare the bug away. Just look into BugDisappear*.hs
files and lines 500-514 of a build log. Removing call to hspec
fixes the bug. Removing redundant import fixes the bug. It's crazy.
If you are using docker I can prepare a container for you.
One thing I notice in Inline.hs is that there is no explicit
exports list:
module Text.CommonMark.Inline where
will export everything. You might try just exporting the
definitions you need to use elsewhere. (this is good practice
anyway).
module Text.CommonMark.Inline (parseInlines, ...) where
This will give you more fine-grained control over what is
imported when you do
import Inline
Alternatively, you can use qualified imports. (It's good
practice to do both.)
@jgm Sure. I was going to do some cleanup around there before releasing, and it was included in the plan. The code is quite dirty now :(
Of course. Well, it will be interesting to see if you can
get more fine-grained data about what triggers the bug, by
explicitly restricting imports/exports.
+++ Konstantin Zudov [Apr 07 15 09:38 ]:
[1]@jgm Sure. I was going to do some cleanup around there, before
releasing. The code is quite dirty now :(—
Reply to this email directly or [2]view it on GitHub.References
@jgm Oh. I see now. Didn't consider it from this perspective. Will try to experiment.
@jgm Good news. (At least it made me happy).
I've managed to reproduce the bug with a much smaller example.
This basic code is working fine:
-- file: Fine.hs
module Main where
import Criterion.Main (defaultMain)
import qualified Data.Text.IO as T
import CMark (commonmarkToHtml)
main :: IO ()
main = do
file <- T.readFile "../benchinput.md"
putStrLn "1. Convert and print last"
print $ last $ show $ commonmarkToHtml [] file
zudov@x200 /tmp/ $ ghc Fine.hs && ./Fine
[1 of 1] Compiling Main ( Fine.hs, Fine.o )
Linking Fine ...
1. Convert and print last
'"'
However adding an empty criterion benchmark makes it fail
-- file: Bug.hs
module Main where
import Criterion.Main (defaultMain)
import qualified Data.Text.IO as T
import CMark (commonmarkToHtml)
main :: IO ()
main = do
file <- T.readFile "../benchinput.md"
putStrLn "1. Convert and print last"
print $ last $ show $ commonmarkToHtml [] file
putStrLn "2. Empty Benchmark"
defaultMain []
zudov@x200 /tmp/ $ ghc Bug.hs && ./Bug
[1 of 1] Compiling Main ( Bug.hs, Bug.o )
Linking Bug ...
1. Convert and print last
Bug: Cannot decode byte '\x3c': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream
This time a mere presence of defaultMain
from criterion
triggers it.
In my case benchinput
is a file which is created by make bench
from cmark
, but it could be any file with unicode characters.
And again, running it in interpreted mode scares the bug away
zudov@x200 /tmp/ $ runhaskell Bug.hs
1. Convert and print last
'"'
2. Empty Benchmark
Well, now we at least have more or less reproducible testcase without any uncommon dependencies. It would be great if I would manage to replicate it in a module is only using base
, but it's already a way to go.
Excellent, this gives us something to work with!
Can you replace the line
file <- T.readFile "../benchinput.md"
with
let file = "ABC"
with some minimal string in place of ABC? That would help
narrow things down even more.
Interestingly, I can't reproduce the bug. I compiled Bug.hs (using ghc 7.10.1) and ran it, and got:
% ./Bug
1. Convert and print last
'"'
2. Empty Benchmark
The only change I made to your program was "../benchninput.md"
-> "benchinput.md"
(since I have it in a different directory).
Hmm. First, I dropped the file reading part:
module Main where
import Criterion.Main (defaultMain)
import Data.Text (pack)
import qualified Data.Text.IO as T
import CMark (commonmarkToHtml)
main :: IO ()
main = do
let file = pack "«"
putStrLn "1. Convert and print last"
print $ last $ show $ commonmarkToHtml [] file
putStrLn "2. Empty Benchmark"
defaultMain []
This still reproduces the bug.
I will now try to build it with ghc 7.10.1.
I've set up a travis build which builds Bug.hs
in multiple environments. .travis.yml
is pretty self-explanatory.
The build shows that bug disappears in GHC 7.8.4.
+++ Konstantin Zudov [Apr 09 15 22:22 ]:
The build shows that bug disappears in GHC 7.8.4.
Interesting! Any idea why?
https://downloads.haskell.org/~ghc/7.8.4/docs/html/users_guide/release-7-8-4.html
I suppose there might also be relevant changes in base that are associated with this ghc release?
+++ Konstantin Zudov [Apr 10 15 02:41 ]:
[1]#9390 attracted my attention but I am not sure is that it.
—
Reply to this email directly or [2]view it on GitHub.References
Can we close this?