mcognetta/ThreeHotKoreanModeling

My two cents

Opened this issue · 1 comments

Copolot suggested this repository while adding additional tokens (James) to my tokenizer.

Here's my two cents:

I'm afraid to say that this is basically character-level encoding or the same as one hot encoding with every single Korean character in the vocabulary because embedding is doing the same thing already.
and three hot encodings is what exactly the "Unicode" Korean table does, too.

Hi, thanks for the comments.

Copolot suggested this repository while adding additional tokens (James) to my tokenizer.

Interesting, this is my first time having that happen. I am flattered.

I'm afraid to say that this is basically character-level encoding or the same as one hot encoding with every single Korean character in the vocabulary because embedding is doing the same thing already.

I think you are slightly misunderstanding what our work does. We are doing character (= syllable/음절) level modeling, but we are doing it in a way that reduces parameter counts by only using subcharacter/자모 features. You can read about it here: https://aclanthology.org/2023.eacl-main.172/.

On the encoding side there are roughly 3 options:

One-hot syllable: requires 11k embedding vectors
One-hot jamo: requires ~70 embedding vectors, but 3x sequence length
Three-hot syllable: requires 70 embedding vectors but syllable-level sequence length

Our's is three-hot syllable, so we do produce a single syllable-level encoding for each syllable in the text, but its made from a combination of the component jamo parts.

However, our work mainly focused on the output side, where there is a fourth option: independent three-hot syllable (https://koreascience.kr/article/CFKO201832073079068.pdf). We show that this one doesn't properly model syllables, and propose conditional three-hot syllable decoding which also only requires ~70 embedding vectors and outputs full syllables in one timestep.

So, to summarize, we are doing character-level encoding but with a reduced parameter count (11k -> 70 embedding vectors but no sequence length increase).