keras-team/keras-preprocessing

skip-grams negative sample should not sample from whole vacabs

Closed this issue · 1 comments

When doing negative sampling, the indices should be sampled from outside of current window, by definition.

However, in tf.keras.preprocessing.sequence.skipgrams, when sampling [center word index, context word index], context word index is sampled from whole range of index, including corresponding within-window context word indices. (line 225)

if negative_samples > 0:
num_negative_samples = int(len(labels) * negative_samples)
words = [c[0] for c in couples]
random.shuffle(words)
couples += [[words[i % len(words)],
random.randint(1, vocabulary_size - 1)]
for i in range(num_negative_samples)]
if categorical:
labels += [[1, 0]] * num_negative_samples
else:
labels += [0] * num_negative_samples

As a result, positive couples of [center word index, within-window context word index] might have two opposing label (0: negative, 1: positive).

I could verify this issue with following simple code.

from tensorflow.keras.preprocessing.sequence import skipgrams

seq = [1, 2, 3, 4]

sgns = skipgrams(seq, 5, window_size=3, negative_samples=1)
sgps  = skipgrams(seq, 5, window_size=3, negative_samples=0)

def find_mislabeled(sg_arr):
    hmap = {}
    for couple, label in zip(*sg_arr):
        key = str(couple)
        if key in hmap:
            hmap[key].add(label)
        else:
            hmap[key] = {label,} 
    return {k: v for k, v in hmap.items() if len(v) > 1}
    

print(len(find_mislabeled(sgps)) == 0) # True
print(len(find_mislabeled(sgns)) == 0) # False

login to wrong git account. I will repost with another one.