polm/fugashi

method for preserving half-width spaces?

garfieldnate opened this issue · 8 comments

Not sure if this is a MeCab thing or a Unidic thing, but full-width spaces are properly output while half-width spaces are simply swallowed:

>>> from fugashi import Tagger
>>> TAGGER = Tagger("-Owakati")
>>> TAGGER("ハロー ジャパン")
[ハロー,  , ジャパン]
>>> TAGGER("ハロー ジャパン")
[ハロー, ジャパン]

Do you know of any way to prevent this? Losing characters in the output means having to do extra processing to match input text spans against output text tokens.

polm commented

ASCII spaces (and I think a few other characters, like tabs or newlines) do get special treatment, and do not become tokens. However they are saved - you can access the white space that comes before a token using node.white_space. For example:

import fugashi

tagger = fugashi.Tagger()

nodes = tagger("I like ginger")

# this has no spaces
print("".join([nn.surface for nn in nodes]))
# this is the original
print("".join([nn.white_space + nn.surface for nn in nodes]))

That's very helpful, thank you!

Does this mean that whitespace at the end of an input is lost?

polm commented

I'd actually never thought about that before. It should be present on the EOS node, but fugashi omits that node from the output.

Do you have a use case where trailing white space is important?

I guess "important" is relative, but for annotation systems where you can't change the input text, you need a one-to-one mapping between the input and output so that you can determine where to assign the annotations. Therefore, though it might be overly cautious on my part, I always add an assert that the input and output texts are the same. Generally I prefer systems that output character ranges instead of the (possibly modified) surface strings, as this makes the step of mapping back to input unnecessary.

I noticed in testing that MeCab will also change some whitespace tokens; \n and \t are output as spaces. Luckily multiple whitespace characters don't seem to be collapsed into one, so the length is at least preserved.

Maybe it's rather niche of me to want a list of tokens whose surfaces exactly cover the input text, but for anyone else that needs something similar, here's what I've used:

def __repair_whitespace(text, analysis):
    # To make the input text and output token surfaces match exactly, we need
    # special handling for whitespace because MeCab/Fugashi:
    # 1. assigns leading whitespace to the next token.white_space instead of creating a dedicated token,
    # 2. does not assign trailing whitespace to any token, and
    # 3. converts newlines, etc. into spaces.
    analysis_with_ws = []
    surface_index = 0
    for token in analysis:
        # handle 1. above by adding token.white_space as separate tokens
        if trailing_ws := token.white_space:
            # handle 3. above by retrieving the original whitespace from the input text
            ws_len = len(trailing_ws)
            true_ws = text[surface_index : surface_index + ws_len]
            analysis_with_ws.append(WhitespaceToken(true_ws))
            surface_index += ws_len

        analysis_with_ws.append(token)
        surface_index += len(token.surface)

    # handle 2. above by finding and re-appending trailing whitespace
    if surface_index != len(text):
        trailing_ws = text[surface_index:]
        assert trailing_ws.strip() == "", "Algorithm for re-appending trailing whitespace is trying to append non-whitespace"
        analysis_with_ws.append(WhitespaceToken(trailing_ws))

    return analysis_with_ws

The WhiteSpace class is just a dummy I made; I didn't want to try to subclass fugashi.fugashi.UnidicNode (a C class), and I couldn't save the output of a previous call to MeCab; after saving TAGGER(" ")[0] in a global variable, it seemed to be overwritten on successive calls to TAGGER (more C stuff, I presume).

Thanks again for your help!

polm commented

I think that saving trailing whitespace in advance should solve this issue for most cases, so I won't take any action on this now, but if it is important I could add an option to fugashi to return the omitted BOS and EOS nodes.

Curious, are the BOS/EOS markers for individual sentences or just for the entire input? Because if there's a possibility to have fugashi segment sentences for me, I would definitely consider it important to keep these markers.

Otherwise, a note in the docs might be helpful. But the next person who searches will find this issue anyway, so probably no biggie :D

polm commented

The BOS/EOS nodes are for the whole input, MeCab doesn't do any kind of sentence segmentation. You can see the EOS node if you use mecab (or fugashi) on the CLI.

I'll see about making this clearer in the docs.