an-cabal/an-rope

Consider factoring out rope metrics to one type?

Closed this issue · 10 comments

hawkw commented

@twisted-pear, do we want to consider rewriting Rope metrics so that rather than having implementations for Lines and Graphemes, we factor them out to one type, like

Measure { lines: usize
        , graphemes: usize 
        , bytes: usize }

or similar?

That way, rather than having to walk children once to measure lines and then again to measure graphemes, we can do it all at once?

hawkw commented

@twisted-pear, I'm happy to make the change if we go with something like this, unless you want to write it. Just figured I should get your opinion first before breaking a bunch of your code.

hawkw commented

If we go this route, might wanna consider changing the way we represent nodes to something like:

struct Node { measure: Measure
           , value: NodeType
           }

enum NodeType { Leaf(String)
              , Branch { left: Box<Node>, right: Box<Node> }
              }

or even just

struct Node { lines: usize 
            , graphemes: usize 
            , bytes: usize 
            , value: NodeType
            }

enum NodeType { Leaf(String)
              , Branch { left: Box<Node>, right: Box<Node> }
              }

This also means we could start caching measurements for Leaf nodes as well as for Branch nodes, which might be good for performance (although string.len() is fast, counting the number of lines/graphemes in a string is Less So)...

do we actually walk the tree to measure them though? as far as i can tell both the grapheme count and the line count are cached in branch nodes.

i'm not averse to the idea, but if you do it, take a look at the latest stuff in the node-line-count branch first, it implements splitting on lines and also fixes some edge cases, where the code previously assumed that measure == 0 means that the node is empty (which is no longer true for lines).

oh, right, we don't do caching on leaves...

hawkw commented

do we actually walk the tree to measure them though? as far as i can tell both the grapheme count and the line count are cached in branch nodes.

I guess we don't actually walk very far, but as you mentioned we don't do caching on leaves, and we call measure() on a node's children twice when making new branches. We could do it just once, and have the whole Measure type implement Add<Self> so we could just say

Node { measures: left.measure() + right.measure()
     , value: Branch { left: box left, right: box right }
     }

when creating a new node

hawkw commented

It occurs to me just now that this would also let us reuse cached values when splitting a leaf into a new branch, by only mutating the value field and replacing it with a Branch and just leaving the measurements in place.

dunno, the performance gain sounds somewhat superficial, especially considering that a whole lot is still entirely unimplemented...

but you looked at the performance benchmarks, and maybe refactoring later would be harder so if you think it's a good idea go ahead

hawkw commented

dunno, the performance gain sounds somewhat superficial, especially considering that a whole lot is still entirely unimplemented...

but you looked at the performance benchmarks, and maybe refactoring later would be harder so if you think it's a good idea go ahead

yeah, part of the reason i'm bringing this up now is that i feel like waiting will only make it harder - we'll have to change more and more stuff. haven't done any profiling of our in-development branches, though, so i dunno where the perf hotspots are; the motivation for me is more that i think it'll make our code a lot clearer?

i'm not averse to the idea, but if you do it, take a look at the latest stuff in the node-line-count branch first, it implements splitting on lines and also fixes some edge cases, where the code previously assumed that measure == 0 means that the node is empty (which is no longer true for lines).

okay, i think if i do make this change, i'd like to start off with our dev branches for lines & graphemes merged together? they don't have to be merged into master, but i'd rather not have my unicode branch diverge any further from your line count stuff...

yeah agreed, should probably go into one branch, the lines tests that you wrote do not work yet (but i pulled them) because we don't split lines in from() yet and we don't have an iterator for lines yet

hawkw commented

because we don't split lines in from() yet and we don't have an iterator for lines yet

both of those should be pretty simple – do you wanna make those changes before i merge the dev branches? I can give you some pointers if you need them..