ANSI control characters not treated as zero width

Hey folks!

According to the README.md, control characters should be treated as zero width, but it seems like ANSI color sequences are not currently. Code like the following (using strip_ansi_escapes) will fail for strings containing ANSI control characters:

fn assert_width(s: String) {
    let stripped_width = std::str::from_utf8(&strip_ansi_escapes::strip(s.as_bytes()).unwrap())
        .unwrap()
        .width() as u16;
    let unicode_width = s.width() as u16;
    assert_eq!(
        stripped_width, unicode_width,
        "Mismatched width ({} vs {}) for `{:?}`",
        stripped_width, unicode_width, s
    );
}

...such as:

"\u{1b}[1m========\u{1b}[0m"

Is this expected?

It's unclear to me where the README file says that: the control characters are marked as "neutral" in https://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt and "neutral" is a narrow character in the spec (see http://www.unicode.org/reports/tr11/#ED7). The purpose of this crate is to follow the spec; not try to implement rendered column width in terminals.

@Manishearth : Sorry, I guess it wasn't the README, but instead the docstring for width:

unicode-width/src/lib.rs

Line 102 in b58e85b

/// Control characters are treated as having zero width.

Hmmm, that's interesting. I'm not sure where that came from, will need to go through the spec a bit more when I have time

It seems like the issue might be that colors are actually represented by sequences of characters, rather than with any individual character... strip_ansi_escapes uses the vte crate, which makes it clearer that a state machine is required to interpret the sequence.

The example from the description, rendered with:

format!("Widths: {:#?}", s.chars().map(|c| format!("{:?}: {:?}", c, c.width())).collect::<Vec<_>>())

...looks like:

Widths: [
    "'\\u{1b}': None",
    "'[': Some(1)",
    "'3': Some(1)",
    "'2': Some(1)",
    "'m': Some(1)",
    "'=': Some(1)",
    "'=': Some(1)",
    "'=': Some(1)",
    "'=': Some(1)",
    "'=': Some(1)",
    "'=': Some(1)",
    "'=': Some(1)",
    "'\\u{1b}': None",
    "'[': Some(1)",
    "'0': Some(1)",
    "'m': Some(1)",
]

Oh, yeah, this is behaving as expected then (the \1b is width zero), that's about unicode control characters, not general control sequences -- terminals and other higher level systems are welcome to define their own control sequences.

Ok, thanks: that makes sense.

For whomever might come across this issue next, this is what I think counting characters with unicode-width looks like after having stripped ANSI escape sequences using vte:

use vte::{Parser, Perform};
use unicode_width::UnicodeWidthChar;

fn count_blocks(s: &str) -> usize {
    struct BlockCounter(usize);

    impl Perform for BlockCounter {
        fn print(&mut self, c: char) {
            self.0 += c.width().unwrap_or(0);
        }

        fn execute(&mut self, byte: u8) {
            if byte == b'\n' {
                self.0 += 1;
            }
        }
    }

    let mut block_counter = BlockCounter(0);
    let mut parser = Parser::new();
    for b in s.as_bytes() {
        parser.advance(&mut block_counter, *b)
    }
    block_counter.0
}