Byte Pair Encoding with Pointwise Mutual Information(PMI)

Luke Song (Notre Dame NLP Group)

Overview

This code explores a novel approach for pre-processing the corpus before training

Mechanism

If a corpus is compressed using a Shannon-optimal code, the compressed size would be

For this code, if wordpieces σ1 and σ2 is merged, then let δ be the count of the merged wordpiece. It updates all the variables as follows:

The code merges the two wordpieces that lead to the greatest decrease in compressed size, that is the two wordpieces that maximize:

Standard BPE chooses the two wordpieces that maximize c(σ1σ2). But the above formula multiplies this by a correction factor known as the pointwise mutual information of σ1 and σ2, which measures “how much σ1 and σ2 have to do with each other.” It will favor wordpiece pairs with high PMI, and would be expected a word and punctuation to have low PMI. Also, the above formula suggests that it should stop when the maximum of the above formula becomes negative.

Usage

bpe_modified.py -s <number of operations> [-orig] < text > codes_file¹

apply_bpe.py -c codes_file < text > out_file²

1: Standard BPE mode from subword_nmt

2: apply_bpe.py adapted from subword_nmt

chanhee-luke/BPE_PMI

Byte Pair Encoding with Pointwise Mutual Information(PMI)

Overview

Mechanism

Usage