contour-terminal/terminal-good-image-protocol

When to do the cell rasterization?

Opened this issue · 8 comments

jerch commented

As already indicated in our talk I am a bit concerned about applying image transformations late in the draw sequence, as your current draft implies. There are several reasons I find this an unfortunate choice, mainly:

  • high memory needs on terminal buffer - effectively any draw command would need to store the transformation rules for a single cell (if referring to a single uploaded image resource), or would have to create interim image resources from diverging transformation rules multiplying the image storage needs (hard to limit in a comprehensive way so appside can work with the limits)
  • complicated draw sequence with tons of arguments - error-prone, as the cropping/clipping is done for every single invocation
  • higher image processing needs during every draw command

Ofc I see the benefits in doing it this way, it would open the terminal to much finer grained image output from just one upload. Still the reasons above make we wonder if we should go with a much simpler abstraction and leave complicated image transformations to the appside. Mainly the pixel notion in the draw command bugs me - to be used successfully the appside always needs to know the correct image dimensions, which is not always the case (e.g. from curl <some_image_url>).

My proposal would basically separate the image-to-terminal-cells-clipping (cell rasterization) from the later cropping:

  • do rasterization during upload:
    • determine total cell coverage to be used in terms of CellRange(width, height)
    • apply basic alignment/resize pattern to fit the whole image into that CellRange

The interim state of this upload command would be an image reference, that has a width and a height in terminal cells, not pixels anymore. It makes any follow-up drawing rather simple - just draw CellRange(width, height) starting at Cell(sx, sy) of the image to Cell(dx, dy) of the display buffer. (We basically replaced pixels with cells as metrics, thus do the cropping within that new metrics.)

Furthermore changing font size in a terminal and thus different pixel cell coverage can easily be implemented by resizing the image resource behind the reference. (Imho it is important not to allow an image to change cell coverage from different font sizes.)

This would also work easily with multiplexers - they can just forward the upload and the drawing commands and expect them to work if supported.

* high memory needs on terminal buffer - effectively any draw command would need to store the transformation rules for a single cell (if referring to a single uploaded image resource), or would have to create interim image resources from diverging transformation rules multiplying the image storage needs (hard to limit in a comprehensive way so appside can work with the limits)

In my WIP PoC implementation the only part that actually stores the image data is the upload command, which puts the image into the storage pull.

When a render-image command is received, the image is being rasterized based on the properties the render-image command has been associated with (number of columns / lines, alignment/resize policy). There is no need for the actual screen cell to know anything about the image's data contents except a pointer (reference) to the rasterized image (holding the above line/col/align/resize properties) and a coordinate into the rasterized image that this screen cell likes to draw - no image data needs to be extracted nor held.

When the terminal renderer kicks in and iterates over each screen cell and then hits a cell that contains an image fragment, it then starts fetching the actual data from the rasterized image, and only then the actual rasterization takes place.
Ideally (also in xterm.js, well, at least in my implementation) that extracted image pixel data is then uploaded to the GPU, so that on future invocations there is no need for any rasterization or GPU-upload actions, only referencing the already uploaded image fragment into your GPU texture atlas manager. So you only rasterize once - at the very first GUI render iteration of that image display.

I think that is not to resource wasty, in fact, on the host you only keep the image once in memory and rasterize and GPU-upload also only once per displayed cell, everything else are index operations.

* complicated draw sequence with tons of arguments - error-prone

That is interesting. I think that is exactly what people wanted. splitting upload from render, and be able to reuse the uploaded image. Forcing the upload to also know how to rasterize might be protocol-wise possible but would be a layer-violation if you ask me and also limit the possibilities of reusing the uploaded image in different scenarios.
I don't yet see any added complexity or error-proneness here. :)

as the cropping/clipping is done for every single invocation
I hope I have cleared that argument in the top comment already :-)

* higher image processing needs during every draw command

It's only done on the first operation (as mentioned above), I don't think that makes it more expensive.

Ofc I see the benefits in doing it this way, it would open the terminal to much finer grained image output from just one upload. Still the reasons above make we wonder if we should go with a much simpler abstraction and leave complicated image transformations to the appside.

When you here refer to image alignment and image resize policy, I actually agree. However the client app does not necessarily know how to crop/resize to fill every pixel as it does not necessarily know the aspect ratio and pixel dimension of the grid cells. I've taken that part also from the orginal forum post that explains the here formalized "good image protocol". :)

Mainly the pixel notion in the draw command bugs me - to be used successfully the appside always needs to know the correct image dimensions, which is not always the case (e.g. from curl <some_image_url>).

Maybe I do not understand your point here exactly, but I was attempting very hard to not have any notion of pixels for the client app side.

The other good point should be, a proxying terminal emulator does not even need to understand the image format, it can simply treat it as a blob, which hopefully should satisfy the tmux author - I hope :)

My proposal would basically separate the image-to-terminal-cells-clipping (cell rasterization) from the later cropping:

👍 (i remember I have read that, will do again and see what we can incoorperate)

* do rasterization during upload:

I mentioned above why I am against that (and I think the original GIP post didn't see it that way either, I have to reread while getting into the subject again though)

Furthermore changing font size in a terminal and thus different pixel cell coverage can easily be implemented by resizing the image resource behind the reference. (IMHO it is important not to allow an image to change cell coverage from different font sizes.)

In the current protocol draft, when a font change / font resize would take place (at least in my implementation), the GPU cache is being flushed and therefore the rasterized image has to be refreched which would then realign/resize based on the given rasterization properties.

This would also work easily with multiplexers - they can just forward the upload and the drawing commands and expect them to work if supported.

I took special care of that in the spec to make it as simple as possible for tmux-like proxy terminals (as mentioned above). If I have missed something, I of course have to adapt then :)

p.s: sorry for the late reply (you know), I had busy times in other areas I had to close down first before resuming my work on this draft spec and PoC impl. :)

jerch commented

Ah sorry, my post is kinda old and I have to refresh my memory first, why I wrote it that way.

I think we are not that far from each other, as it might sound above. Lemme quickly iterate, which steps I do it in my playground branch and found to be efficient in the browser env:

  1. upload sequence tells TE, that the image in the payload should be of size rows x cols independent of the pixel size
  2. impl detail (not needed to be done like that, e.g. a multiplexer could just keep the payload data here):
    a. xterm.js receives & decodes file to pixel data
    b. image storage saves original data unaltered + a fitted one perfectly matching the rows x cols size as actual
  3. onDraw (either directly for a non ref sequence, or later from clip-drawing sequence):
    a. add tile IDs to x * y terminal buffer cells starting at offset (default current cursor pos), size transformation is not possible here, the tile IDs are simply indexed from static clipping of actual as xy extend of image_pixel_size / cell_pixel_size
  4. onRender - read tile ID, clip actual from (ID / image_rows)/(ID % image_rows) to dx/dy (viewport target)
  5. onFontResize - recreate actual from original with stable cell coverage in mind

Thats basically all I have to do. There is just one size transformation needed during upload, and later if the user changes font/cell size. The reason why I do the size transformation on first sight is simple - it is a single action per image and speeds up later rendering in 4. roughly 2 times (the real rendering gets reduced to a simple 1:1 bitmap clipping). But for that to work, I need the later ref-clip-drawing sequence to be transformation free - thats basically the idea behind the post.

This is directly linked to the GPU here - I see where you come from, having full access to OpenGL primitives gives you much more transformation freedom. But thats not the case for many TEs (framebuffer TEs, kernel consoles), thus should not influence the sequence shapes. I think pretagging a fixed grid size to the image resource and clipping tiles 1:1 is the better general purpose interface.

Or to put it more simple:

  • the upload sequence tags the image to a fixed rows * cols size, the stored image resource basically carries that as its size metrics for the time being
  • later on the ref drawing sequence just pulls tile areas from that "grid-fixed" image resource

Ofc this all is implementation detail, I could also do late transformation all the time (I actually started with that). But by reshaping the sequences as above, a TE without OpenGL access can decide to optimize image resource handling the way I did in xterm.js. Thats not possible, if the size transformation is hard tagged to the later ref-drawing sequence.
And to not just name the advantages - it acually comes with a downside: any later ref-drawing sequence cannot leave the fixed grid alignment anymore, which is possible with your sequences.

okay. this is were the major difference lies, however, your rasterization step (creating the "actual" image) in GIP is currently done in the Render command. So also only once. I think that should be okay to move from upload to render (the rasterization step), as this command is also only invoked exactly once per image display; in the end you only keep one rasterized image, it's just happening on the Render command, and not on the Upload command. I hope that's fine with your thinking :-)

wrt OpenGL, you are absolutely right, that is one of the reasons for me to be strictly introducing a Z-axis (see kitty protocol - we should have a separate discussion on that).
However, while I was mentioning "uploading to the GPU", I think what I call the texture atlas (containing all mini-tiles for each cell) can also be kept on the host so that there should be no additional impact to your current way of drawing text (where -IMHO- you also keep some kind of texture cache for your rasterized text glyphs). I hope that would not hold TEs like yours off the road :)

jerch commented

... as this command is also only invoked exactly once per image display; in the end you only keep one rasterized image, it's just happening on the Render command, and not on the Upload command.

For me thats the crucial point. With your sequence an app can freely resize on every clip-drawing sequence (thus I need to store several resized versions), while I think we should limit it to one grid size for an image ref resource as long as it lives in the storage.

... that is one of the reasons for me to be strictly introducing a Z-axis (see kitty protocol - we should have a separate discussion on that).

Thats basically the idea behind my layering issue #11. Guess we should start over from scratch, my issue is convoluted with image layering/blending and whether a layer should follow text grid mechanics and where it should appear relative to other TE layers. That are in fact multiple issues in one.

For me thats the crucial point. With your sequence an app can freely resize on every clip-drawing sequence (thus I need to store several resized versions), while I think we should limit it to one grid size for an image ref resource as long as it lives in the storage.
So basically you request a command for uploading the image with the following parameters:

  • name (to identify the image later on)
  • image-format (of the image data)
  • image-data (actual pixel data, may include PNG etc)
  • image-size (width and height in pixels, this is optional as PNG etc do have that info inline)
  • screen-size (that is the new part, how to pre-rasterize)
  • screen-alignment
  • screen-resize (both have therefore to be moved here, too).

To be clear, I am not against that, I just want to be super sure about.
One might even put that into a separate step

  1. upload named image blob
  2. rasterize named image blob (with given screen-* attribs; can be called multiple times on the same image)
  3. render image (as often as you want, referencing a rasterized image)
  4. releasing names referencing (rasterized) image.

Currently in GIP point 2 and 3 are merged. In your request, point 1 and 2 are merged.
I now wonder if it would make sense to just have point 2 as its separate command then (still providing the ONESHOT render command of course, which merges all of these commands into a single one, for cat-like image displays)

I wouldn't try to propose yet another solution (even more VT GIP commands to satisfy all), because then you'd still need to implement the current proposed commands.

I think I'm fine in moving point 2 up into the upload command then (maybe there isn't that much usecase for that afterall, and if, it could be added as an improved protocol version (1.1 or so) and builds upon the existing but adds more commands if necessary.

... that is one of the reasons for me to be strictly introducing a Z-axis (see kitty protocol - we should have a separate discussion on that).

Thats basically the idea behind my layering issue #11. Guess we should start over from scratch, my issue is convoluted with image layering/blending and whether a layer should follow text grid mechanics and where it should appear relative to other TE layers. That are in fact multiple issues in one.

I'll post my reply on that matter in #11 then.

jerch commented

👍 Yes, thats basically what I had in mind (beside some more params like the real bytelength to do at least basic sanity check in the sequence parser).

  1. upload named image blob

Interesting that you bring that up now, I also have thought about that (as mentioned earlier in the image xterm.js PR) up to generalize that further as a universal blob upload sequence. As I wrote over there, I think we will see more upload demand for other data types as well - next hottest candidates are vector graphics (we dont even have a good format for that yet), and some reduced PDF things. With broader image support we basically open a door for terminals - limited document support. Not sure how far that could be generalized in the end, I didnt went further down that road yet.

note that i am already displaying many PDFs in Notcurses via sixel/kitty as a result of ffmpeg cheerfully decoding them into RGBA. IMHO this kind of thing ought be client-side, rather than adding formats to decode to the terminal (and how will that work -- will you have mime-type discovery available to the client?).

jerch commented

@dankamongmen Nah nothing concrete yet into the document direction, and in general I agree with you - we should not try to reshape terminals into a text stream based browser/rich client. It is beyond their scope as IO "extension" of cmdline apps.

Still just providing capable basic output interfaces vs. going document style is a thin line we already crossed here with image formats. Same gonna happen with vector graphics, as there is some demand for that from typical plotting REPLs. (If I ever find time for that I prolly will strip down SVG-tiny into a terminal secure variant as SVG-T.)