purescript/purescript-strings

Data.String.CodePoints.uncons probably isn't constant-time

hdgarrood opened this issue · 4 comments

Here's the documentation for uncons:

-- | Returns a record with the first code point and the remaining code points
-- | of the string. Returns Nothing if the string is empty. Operates in
-- | constant space and time.
-- |
-- | ```purescript
-- | >>> uncons "𝐀𝐀 c 𝐀"
-- | Just { head: CodePoint 0x1D400, tail: "𝐀 c 𝐀" }
-- | >>> uncons ""
-- | Nothing
-- | ```
-- |
uncons :: String -> Maybe { head :: CodePoint, tail :: String }

I would have expected this to be O(n), because you need to copy almost the entirety of the argument string into the tail field of the result. We should probably verify this before changing it.

cc @michaelficarra, I'd appreciate any input you might have here.

@hdgarrood I don't work on any of the big JavaScript engines, but I'm pretty sure any reasonably-performant JavaScript engine will implement .slice(...) on strings without copying. Remember, strings are immutable in JavaScript. I don't think it's unreasonable to impose these complexity constraints on other backends either. If their language's native/convenient strings don't have this property (likely because they're mutable), they should back PureScript strings by a library implementation that does.

Oh of course, that’s good to hear. Thanks!

I just tried to verify this with the following program:

module Main where

import Prelude
import Data.Array as Array
import Data.Foldable (for_, foldMap)
import Data.Int as Int
import Data.Maybe (fromMaybe, maybe, Maybe(..))
import Data.String as String
import Data.Unfoldable (replicateA)
import Effect (Effect)
import Effect.Console (log, logShow)
import Effect.Random (randomInt)
import Performance.Minibench (bench)

main :: Effect Unit
main = do
  log "testing (should print 15):"
  logShow (digitSum "lol 1234 hahaha 5")

  let inputLengths = (map (_ * 100_000) (Array.range 1 10))
  for_ inputLengths \l -> do
    log ""
    log ("string of length " <> show l)
    s <- genRandomString l
    bench \_ -> digitSum s

sampler :: String
sampler = "abcdefg1234567890"

genRandomString :: Int -> Effect String
genRandomString resultLength =
  let
    len = String.length sampler
  in
    map (foldMap (maybe "" String.singleton) :: Array _ -> _)
      $ replicateA resultLength do
          ix <- randomInt 0 (len - 1)
          pure $ Array.index (String.toCodePointArray sampler) ix

-- | Sum all digits appearing in the input string.
digitSum :: String -> Int
digitSum = go 0
  where
  go acc s =
    case String.uncons s of
      Just { head, tail } ->
        let
          value =
            fromMaybe 0
              $ Int.fromString
              $ String.singleton head
        in
          go (value + acc) tail
      Nothing ->
        acc

Here are the results on Node v12.4.0:

testing (should print 15):
15

string of length 100000
mean   = 11.67 ms
stddev = 495.70 μs
min    = 11.37 ms
max    = 25.32 ms

string of length 200000
mean   = 23.62 ms
stddev = 438.12 μs
min    = 23.00 ms
max    = 31.32 ms

string of length 300000
mean   = 35.22 ms
stddev = 594.21 μs
min    = 34.35 ms
max    = 50.18 ms

string of length 400000
mean   = 53.78 ms
stddev = 3.28 ms
min    = 47.85 ms
max    = 61.54 ms

string of length 500000
mean   = 75.39 ms
stddev = 779.25 μs
min    = 73.91 ms
max    = 82.14 ms

string of length 600000
mean   = 86.23 ms
stddev = 5.49 ms
min    = 76.61 ms
max    = 108.24 ms

string of length 700000
mean   = 113.51 ms
stddev = 1.19 ms
min    = 110.47 ms
max    = 123.14 ms

string of length 800000
mean   = 114.60 ms
stddev = 7.41 ms
min    = 100.95 ms
max    = 149.26 ms

string of length 900000
mean   = 137.33 ms
stddev = 8.73 ms
min    = 121.62 ms
max    = 156.14 ms

string of length 1000000
mean   = 155.63 ms
stddev = 10.15 ms
min    = 137.15 ms
max    = 175.06 ms

which seems close enough to linear, which is what we'd expect with a constant-time uncons.

Here's a graph:
Screenshot from 2020-02-23 02-33-05

So I'm happy to close this.