jgm/cmark-hs

Problem with string decoding

zudov opened this issue · 26 comments

zudov commented

Hello. I met a bug which drives me crazy.

The problem is that in very specific conditions code commonMarkToNode [] (T.pack "«") fails with this error:

Cannot decode byte '\xc2': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream.

I guess that the problem is with string marshaling. Probably byte length is messed up with codepoint length somewhere.

Here is a sniplet which allows to get the same error in pure haskell.

Prelude> :m +Data.Text
Prelude Data.Text> :m +Data.Text.Encoding
Prelude Data.Text Data.Text.Encoding> let s = pack "«"
Prelude Data.Text Data.Text.Encoding> decodeUtf8 $ Data.ByteString.take (Data.Text.length s) $ encodeUtf8 $ s
"*** Exception: Cannot decode byte '\xc2': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream
Prelude Data.Text Data.Text.Encoding> 

It happens only if the string contains non-ascii symbols.

To reproduce a whole problem a very specific conditions must be met. It never occurs in interactive evaluation (only when called from binary). Unimporting some modules from binary can solve the problem. And so on. I am trying to say that it's really hard to reproduce, but if it's reproduced it stably appears on every execution. In case if the error-line above wouldn't give you insight on a problem, I can capture it and ship for experiments.

jgm commented

+++ Konstantin Zudov [Apr 06 15 07:35 ]:

I guess that the problem is with strings marshaling.

Here is a sniplet which allows to reproduce the error in pure haskell.

Prelude> :m +Data.Text
Prelude Data.Text> :m +Data.Text.Encoding
Prelude Data.Text Data.Text.Encoding> let s = pack "«"
Prelude Data.Text Data.Text.Encoding> decodeUtf8 $ Data.ByteString.take (Data.Text.length s) $ encodeUtf8 $ s
"*** Exception: Cannot decode byte '\xc2': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream

I can see why you get the exception here.

Prelude Data.Text.Encoding Data.Text> encodeUtf8 s
"\194\171"

The utf-8 encoding for s is two bytes, 194 and 171. And, decodeUtf8 works fine on these two bytes:

Prelude Data.Text.Encoding Data.Text> decodeUtf8 $ encodeUtf8 s
"\171"

But you're interposing Data.ByteString.take (Data.Text.length s). The length of s is 1 -- this is the length of the text in characters (unicode code points), not bytes!

So,

Prelude Data.Text.Encoding Data.Text> Data.ByteString.take (Data.Text.length s) $ encodeUtf8 s
"\194"

And of course decodeUtf8 will raise an error on this single byte, which isn't a valid UTF-8 sequence!

zudov commented

@jgm Yes, this incorrect snipplet was intentionally written to demonstrate assumption about the nature of the bug. But I couldn't figure out why the same exception occurs inside commonmarkToNode

zudov commented

Based on the error message Cannot decode byte '\xc2' that is exactly what happens inside commonmarkToNode, decodeUtf8 tries to decode '\194' instead of "\194\171".

jgm commented
zudov commented

@jgm Thanks. I am going to bed now, but after I wake up I will prepare some code.

jgm commented

If I had to guess, I'd guess the problem might be with this definition:

fromtext :: Text -> CString
fromtext t = io $! withCString (unpack t) return

In other places we use Data.Text.Foreign.withCStringLen -- it's worth a try rewriting this function to use that. (I can't recall why I didn't in the first place, but it may be that withCStringLen doesn't return a pointer to a null terminated string?)

zudov commented

@jgm Unfortunately the latest commit didn't help. So I tried to make a demo for you.

No matter how I tried, I couldn't reproduce it with a standalone piece of code, so I had to prepare a full repo with a travis build that demonstrates a failure.

Primarily, there are three files that you will be interested in:

A travis install script builds all dependencies and installs git version of cmark-hs. It then executes a demonstration script.

The three files that I mentioned above, are compiled as test-suites (standalone binaries).

Then, travis.sh runs Bug.hs's binary (bug). This fails with "Cannot decode byte..." error message. However if the same file is interpreted with runhaskell the commonmarkToNode works OK and produces expected result.

Then travis.sh demonstrates this small crazy things that scare the bug away. Just look into BugDisappear*.hs files and lines 500-514 of a build log. Removing call to hspec fixes the bug. Removing redundant import fixes the bug. It's crazy.

zudov commented

If you are using docker I can prepare a container for you.

jgm commented

One thing I notice in Inline.hs is that there is no explicit
exports list:

module Text.CommonMark.Inline where

will export everything. You might try just exporting the
definitions you need to use elsewhere. (this is good practice
anyway).

module Text.CommonMark.Inline (parseInlines, ...) where

This will give you more fine-grained control over what is
imported when you do

import Inline

Alternatively, you can use qualified imports. (It's good
practice to do both.)

zudov commented

@jgm Sure. I was going to do some cleanup around there before releasing, and it was included in the plan. The code is quite dirty now :(

jgm commented

Of course. Well, it will be interesting to see if you can
get more fine-grained data about what triggers the bug, by
explicitly restricting imports/exports.

+++ Konstantin Zudov [Apr 07 15 09:38 ]:

[1]@jgm Sure. I was going to do some cleanup around there, before
releasing. The code is quite dirty now :(


Reply to this email directly or [2]view it on GitHub.

References

  1. https://github.com/jgm
  2. #1 (comment)
zudov commented

@jgm Oh. I see now. Didn't consider it from this perspective. Will try to experiment.

zudov commented

@jgm Good news. (At least it made me happy).

I've managed to reproduce the bug with a much smaller example.
This basic code is working fine:

-- file: Fine.hs
module Main where

import           Criterion.Main (defaultMain)
import qualified Data.Text.IO as T
import           CMark (commonmarkToHtml)

main :: IO ()
main = do
    file <- T.readFile "../benchinput.md"
    putStrLn "1. Convert and print last"
    print $ last $ show $ commonmarkToHtml [] file
zudov@x200 /tmp/ $ ghc Fine.hs && ./Fine 
[1 of 1] Compiling Main             ( Fine.hs, Fine.o )
Linking Fine ...
1. Convert and print last
'"'
zudov commented

However adding an empty criterion benchmark makes it fail

-- file: Bug.hs
module Main where

import           Criterion.Main (defaultMain)
import qualified Data.Text.IO as T

import CMark (commonmarkToHtml)

main :: IO ()
main = do
    file <- T.readFile "../benchinput.md"
    putStrLn "1. Convert and print last"
    print $ last $ show $ commonmarkToHtml [] file
    putStrLn "2. Empty Benchmark"
    defaultMain []
zudov@x200 /tmp/ $ ghc Bug.hs && ./Bug
[1 of 1] Compiling Main             ( Bug.hs, Bug.o )
Linking Bug ...
1. Convert and print last
Bug: Cannot decode byte '\x3c': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream

This time a mere presence of defaultMain from criterion triggers it.

In my case benchinput is a file which is created by make bench from cmark, but it could be any file with unicode characters.

zudov commented

And again, running it in interpreted mode scares the bug away

zudov@x200 /tmp/ $ runhaskell Bug.hs 
1. Convert and print last
'"'
2. Empty Benchmark
zudov commented

Well, now we at least have more or less reproducible testcase without any uncommon dependencies. It would be great if I would manage to replicate it in a module is only using base, but it's already a way to go.

jgm commented

Excellent, this gives us something to work with!

Can you replace the line

file <- T.readFile "../benchinput.md"

with

let file = "ABC"

with some minimal string in place of ABC? That would help
narrow things down even more.

jgm commented

Interestingly, I can't reproduce the bug. I compiled Bug.hs (using ghc 7.10.1) and ran it, and got:

% ./Bug
1. Convert and print last
'"'
2. Empty Benchmark

The only change I made to your program was "../benchninput.md" -> "benchinput.md" (since I have it in a different directory).

zudov commented

Hmm. First, I dropped the file reading part:

module Main where

import           Criterion.Main (defaultMain)
import           Data.Text (pack)
import qualified Data.Text.IO as T

import CMark (commonmarkToHtml)

main :: IO ()
main = do
    let file = pack "«"
    putStrLn "1. Convert and print last"
    print $ last $ show $ commonmarkToHtml [] file
    putStrLn "2. Empty Benchmark"
    defaultMain []

This still reproduces the bug.

I will now try to build it with ghc 7.10.1.

zudov commented

I've set up a travis build which builds Bug.hs in multiple environments. .travis.yml is pretty self-explanatory.

The build shows that bug disappears in GHC 7.8.4.

jgm commented

+++ Konstantin Zudov [Apr 09 15 22:22 ]:

The build shows that bug disappears in GHC 7.8.4.

Interesting! Any idea why?
https://downloads.haskell.org/~ghc/7.8.4/docs/html/users_guide/release-7-8-4.html

zudov commented

#9390 attracted my attention but I am not sure is that it.

jgm commented

I suppose there might also be relevant changes in base that are associated with this ghc release?

+++ Konstantin Zudov [Apr 10 15 02:41 ]:

[1]#9390 attracted my attention but I am not sure is that it.


Reply to this email directly or [2]view it on GitHub.

References

  1. https://ghc.haskell.org/trac/ghc/ticket/9390
  2. #1 (comment)
zudov commented

@jgm It could be base. Hard to guess.

jgm commented

Can we close this?

zudov commented

@jgm It's up to you.

The issue is still present (although it's only with an oldish compiler version).

As there are no ideas at the moment, I think it would be OK to close the issue. If some additional information pops up in the future, we always can reopen it.