alvations/gachalign

How was the bead_cost derived?

Opened this issue · 1 comments

At https://github.com/alvations/gachalign/blob/master/gale-church.py#L24, there's a using a dictionary of pattern-integer entries called BEAD_COSTS.

BEAD_COSTS = {(1, 1): 0, (2, 1): 230, (1, 2): 230, (0, 1): 450,
              (1, 0): 450, (2, 2): 440 }

I wonder how you arrived at the values 0, 230, 450, and 440. What do these figures mean?

Would it make any difference to have values of a smaller scale, like 0.23 and 0.45?

I wonder how you arrived at the values 0, 230, 450, and 440. What do these figures mean?

The bead_costs comes from the original paper, there's a C implementation inside: https://www.aclweb.org/anthology/J93-1004.pdf (page 96).

Description of penalty is on page 83, it's used in the dynamic programming alignment to penalize the lesser probable alignments.

Would it make any difference to have values of a smaller scale, like 0.23 and 0.45?

The penalties 0, 230, 450, 440 are in the -100 * logscale and it affects the length cost at https://github.com/alvations/gachalign/blob/master/gale-church.py#L27 so raw probabilities won't work there.


To get the penalties or bead_cost of other probability distributions, you have to work backwards a little, here's an example. Assuming that you have an aligned corpus and you run a count on the alignment patterns and found out that:

89% of the time it's a 1 source to 1 target sentence alignment
8.9% of the time it's a (i) 2 source to 1 target sentence alignment or (ii) 1 source to 2 target sentence alignment

So probabilities are:

prob( event(1) ) = 0.89
prob( event(2) ) = 0.089

Then we take the proportion of event(2) / event(1):

event(2) / event(1) = 0.089 / 0.89 = 1 / 10

Finally, we take -100 * log and that's the cost of 2 source sentences aligning to 1 target sentence or vice versa:

>>> import math
>>> math.log(0.1) 
-2.3025850929940455
>>> -100 * math.log(0.1) 
230.25850929940455

It's -100 because the original match() function returns a distance of -100 * log probabilities, see match() in page 95 of https://www.aclweb.org/anthology/J93-1004.pdf

The log space is used because described in Section 4 of the paper:

The log is introduced here so that adding distances will produce desirable results."

I guess adding log probs is more desired than multiply raw probs.

The -100 might be there because of stabilities issues when working with float but then again the original authors used floats somewhere else, so I'm not sure what's the real motivation there. But if you would like to use a logspace scale without the -100, you might have to refactor the code and remove the -100 multiplier from various places.

Hope the explanation helps!