skip-grams negative sample should not sample from whole vacabs
Closed this issue · 1 comments
GSSHOP-KimKC commented
When doing negative sampling, the indices should be sampled from outside of current window, by definition.
However, in tf.keras.preprocessing.sequence.skipgrams
, when sampling [center word index, context word index], context word index is sampled from whole range of index, including corresponding within-window context word indices. (line 225)
keras-preprocessing/keras_preprocessing/sequence.py
Lines 219 to 230 in 4538765
As a result, positive couples of [center word index, within-window context word index] might have two opposing label (0: negative, 1: positive).
I could verify this issue with following simple code.
from tensorflow.keras.preprocessing.sequence import skipgrams
seq = [1, 2, 3, 4]
sgns = skipgrams(seq, 5, window_size=3, negative_samples=1)
sgps = skipgrams(seq, 5, window_size=3, negative_samples=0)
def find_mislabeled(sg_arr):
hmap = {}
for couple, label in zip(*sg_arr):
key = str(couple)
if key in hmap:
hmap[key].add(label)
else:
hmap[key] = {label,}
return {k: v for k, v in hmap.items() if len(v) > 1}
print(len(find_mislabeled(sgps)) == 0) # True
print(len(find_mislabeled(sgns)) == 0) # False
GSSHOP-KimKC commented
login to wrong git account. I will repost with another one.