purescript-deprecated/purescript-strongcheck

Arbitrary Strings Not Properly Encodable

Closed this issue · 3 comments

Environment

  • PureScript 0.12
  • Pulp 12.2.0
  • purescript-strongcheck 4.1.1

Hi Folks, I have ongoing work to deal with string encoding for the purpose of supporting cryptography in purescript, and I am attempting to isolate the cause of the problem I encounter.

I suspect currently that the cause of the problem is actually in the way that quickcheck generates strings via Arbitrary. That is to say, that generating randomly ordered CodeUnits (purescript Char), is not truly compliant with the way UTF-16 strings should be encoded. If we wanted to join randomly generated strings, these should be the (potenitially) multi code-unit strings, aka CodePoints.

I think the best solution would be to redefine arbitrary string such that it is generated from Code Points, not Code Units. None of which should fall in the U+0000 to U+D7FF range.

I will be happy over the weekend to write the PR should I find time and there are no objections raised here. It will also be used to verify my theory by using it with existing text encoding work I've done.

Short explanation: current definition of arbitrary for string generates unpaired code units.

Thank you for reaching out and I'm certainly interested to see this! Does the same issue affect QuickCheck as well? They're quite similar.

The primary thing I'd watch out for is performance. Code points are much less efficient than code units, and while correctness is usually preferable -- and especially so for a testing library! -- if performance drops off a cliff we'll have to figure out how to work around that.

It does infact also affect quickcheck and I filed an Issue there as well.

I've been thinking about this a little today though...

I feel an argument could be made that quickCheck -should- throw messed up strings at things for the sake of comprehensiveness, rather than just text units.

Generating "clean well formatted" strings is considerably more complicated, given the utf-16 standard.

Perhaps in a way, a decision should be taken about whether arbitrary strings should be well formatted, or a series of arbitrary code units.