Parsing and unparsing has super-linear runtime

Question

Parsing and unparsing has super-linear runtime

thurstond opened this issue 4 years ago · 6 comments

Parsing and unparsing have super-linear runtime, largely because OCaml list append and string concatenation are not optimized.

Parsing: parse_tilde quadratic (/cubic?) behavior

for n in 1000 2000 4000 8000 16000 32000 64000; do perl -e "print '~'; print 'a' x $n;" > /tmp/t; time ./parse_to_json.native /tmp/t > /dev/null; done

user    0m0.015s
user    0m0.034s
user    0m0.153s
user    0m0.830s
user    0m4.540s
user    0m27.661s
user    3m4.591s

(real/sys runtime omitted for brevity)

Notes:

This is likely because of ast.ml: | c::s' -> parse_tilde (acc @ [c]) s'. It's not obvious to me why it seems to be cubic (rather than quadratic) runtime though.
- An alternative would be to prepend to the list, and then reverse the result at the end.
The parse_to_json.native test application is from https://github.com/andromeda/pash/tree/main/compiler/parser
To avoid conflating the libdash runtime with the JSON serialization, I disabled the JSON serialization code (print_ast) in the test app

Parsing: to_assign cubic (?) behavior

for n in 1000 2000 4000 8000 16000 32000 64000; do perl -e "print 'f' x $n; print '=pay_respects;'" > /tmp/t1; time ./parse_to_json.native /tmp/t1 > /dev/null; done

user    0m0.011s
user    0m0.021s
user    0m0.098s
user    0m0.571s
user    0m3.749s
user    0m25.097s
user    2m52.969s

This also has an expensive list append operation: ast.ml | C c :: a -> to_assign (v @ [c]) a.

Unparsing: ^ string concatenation considered harmful

OCaml's "^" string operator is not optimized; concatenating n strings one at a time can take O(n^2) runtime (https://discuss.ocaml.org/t/whats-the-fastest-way-to-concatenate-strings/2616/7). This is arguably a compiler issue e.g., CPython optimizes for common cases (https://mail.python.org/pipermail/python-dev/2013-February/124031.html).

ast.ml's unparsing is essentially a series of "^" operations, hence everything is going to have a worst-case runtime that's super-linear.

Here's an example of a long pipeline, showing quadratic runtime for json_to_shell:

for n in 2000 4000 8000 16000 32000 64000 128000; do (echo -n "echo 'Hello '"; for f in `seq 1 $n`; do echo -n "| cat "; done; echo) > /tmp/s1; cat /tmp/s1 | ./parse_to_json.native > /tmp/j1; time ./json_to_shell.native /tmp/j1 | md5sum; done

user    0m0.035s
user    0m0.112s
user    0m0.508s
user    0m1.920s
user    0m13.909s
user    0m52.966s
user    4m26.274s

I also tried removing the Ast.to_string operation in json_to_shell.ml, to show that the JSON deserialization by itself is fast (linear); thus, the quadratic runtime is due to the libdash core.

Unparsing: fresh_marker for heredocs is slow on adversarial inputs

fresh_marker tries to find increasingly long-variants of {EOF, EOFF, EOFFF ..}, until it can find a marker that is not contained in the heredoc. This is slow for adversarial inputs that deliberately use all those markers:

cat <<CTHULHU
EOF
EOFF
EOFFF
EOFFFF
EOFFFFF
CTHULHU

mgree commented 4 years ago

😬 😅

Answer 1 · 2021-01-28T19:35:38.000Z

These are amazing (a/k/a awful) performance issues, thanks for finding them!

Answer 2 · 2021-01-29T19:05:02.000Z

Another source of inefficiency in parse_tilde is that OCaml by default has eager evaluation, which means implode will be called at every per-character iteration of parse_tilde, instead of just once when the ret value is actually needed
i.e., for an n-character input, implode will be about n times, leading to O(n^2) runtime.

Code snippet where this happens in ast.ml:

parse_tilde acc = 
  let ret = if acc = [] then None else Some (implode acc) in
  function
  | [] -> (ret , [])
  ...
  | c::s' -> parse_tilde (acc @ [c]) s'

To show this, I changed dash.ast to add a printf to the implode function:

let implode l =
  let s = Bytes.create (List.length l) in
  let rec imp i l =
    match l with
    | []  -> ()
    | (c::l) -> (Bytes.set s i c; imp (i+1) l)
  in
  imp 0 l;
  Printf.printf "implode created: %s\n" s;
  Bytes.unsafe_to_string s

Output:

/pash/compiler/parser# echo "~lovecraft" | ./parse_to_json.native
implode created: l
implode created: lo
implode created: lov
implode created: love
implode created: lovec
implode created: lovecr
implode created: lovecra
implode created: lovecraf
implode created: lovecraft
["Command",[1,[],[[["T",["Some","lovecraft"]]]],[]]]

There is a "Lazy" module in OCaml, or just use Haskell 🙃

Answer 3 · 2021-01-29T19:42:26.000Z

Linear-time versions of parse_tilde and to_assign (changing list append to list prepend + one-off reverse, and removing the eager implode in parse_tilde):

and maybe_implode_rev acc =
  if acc = [] then None else Some (implode (List.rev acc))

and parse_tilde acc =
  function
  | [] -> (maybe_implode_rev acc, [])
  (* CTLESC *)
  | '\129'::_ as s -> None, s
  (* CTLQUOTEMARK *)
  | '\136'::_ as s -> None, s
  (* terminal: CTLENDVAR, /, : *)
  | '\131'::_ as s -> maybe_implode_rev acc, s
  | ':'::_ as s -> maybe_implode_rev acc, s
  | '/'::_ as s -> maybe_implode_rev acc, s
  (* ordinary char *)
  (* TODO 2019-01-03 only characters from the portable character set *)
  | c::s' -> parse_tilde (c :: acc) s'

The copy-and-pasted maybe_implode_rev acc is not as pretty as having ret (or a lazy version of it), but it's fast.

and to_assign v = function
  | [] -> failwith ("Never found an '=' sign in assignment, got " ^ implode (List.rev v))
  | C '=' :: a -> (implode (List.rev v),a)
  | C c :: a -> to_assign (c :: v) a
  | _ -> failwith "Unexpected special character in assignment"

Answer 4 · 2021-02-18T23:11:36.000Z

Relatedly, parse_arg and arg_char are using the call stack when they shouldn't need to.

Answer 5 · 2022-07-10T21:38:39.000Z

After #17 merges, the only remaining issue is the pervasively expensive string concatenation. Converting to_string to use Buffer should resolve this last issue.