typelevel/paiges

Expose `Doc#renderIterator`

lihaoyi opened this issue · 11 comments

There already is an iterator available internally via Chunk.best, but for some reason it's wrapped in a Stream or PrintWriter-based API when exposed outside the package. Neither of those allows for early-termination without accumulating the whole blob in memory, which Iterator does, so I think the library should expose it

User-land workaround:

package org.typelevel.paiges

object Bridge {
  def apply(doc: Doc, w: Int) = {
    Chunk.best(w, doc)
  }
}

This is intentional.

Why can't you stop early with renderStream(w).iterator

That should not accumulate in memory (or an optimal Stream would not and I assumed that scala was safe in this regard).

The reason, by the way, is that most docs are small, and the strings are already in memory. The safety of an immutable object seemed a small price to pay.

Keep in mind: you have to keep the strings in memory anyway.

Good point, I guess the heads of the stream will get GCed as they become no longer needed by the iterator? In that case, I guess the only penalty to using renderStream(w).iterator is just a bit of GC churn, but that's probably fine.

While using iterators is unsafe because they're mutable, I suppose using Stream is unsafe in a different way: if you think it's immutable and treat it like any other immutable object, it's really easy to accidentally go

val s = renderStream(w)
s.iterator

And have that blow up on you, memory-wise?

I guess for my purposes calling renderStream(w).iterator works, but it's not immediately obvious to me that it's safer, or that you need to intentionally not-keep-a-reference to the result of renderStream in order for it to be memory safe.

Anyway, your call if you want to do this or not; if not I'll just add a "// warning: do not assign this stream to a local val" in my code 😛

Yes, that's true holding the head can prevent GC, but keep in mind: we only point to the strings already allocated. So, the size of the Stream in memory should be smaller than the Doc.

Do you imagine there will be an actual issue?

It has been an issue for me in the past, but I do funny things with my pretty-printing.

One of my use cases, for example, was lazily/streaming-ly pretty-printing large (100s of mb) in-memory data structures for debugging, and holding the whole pretty-printed output in memory was enough to make my JVM run out of memory.

The way I see it

  • If you wanted to provide a streaming/lazy renderStream method in the first place, it's probably because the things your printing are rather large? Otherwise just render: String would do.

  • And given that they're rather large, it makes sense to me to then to say you don't want to hold the whole thing in memory at once.

  • And if you don't want to hold the whole thing in memory at once, having an Iterator-based API seems more obvious to me than a Stream-based API. To me, Stream says "Lazy, memoized List", despite the fact that as-you-mentioned you can avoid memoization by not keeping a reference to the tail.

Please open an issue if this fails in that use case.

Since the Stream should only point to less memory than the Doc, I think either you won't be able to hold the Doc or the Stream will be fine.

The reason the stream is better than a String is that we would have to allocate a copy of everything to build it (which we do expose as the render method). The Stream is what allows us to never do a String allocation while rendering, so it is also a speed optimization.

Hmm do you mean that Doc isn't lazy? The readme said it had Lazy, O(1) concatenation, which I thought meant you could create a doc such that it wouldn't even build/allocate parts of it until it came time to render. Or am I mis-understanding what it means by "Lazy"?

Ok, I'll see if I bump into anything