cisnlp/simalign

Error in examples/align_files.py when token-type is word

Closed this issue · 1 comments

Line 172 - 200 of the file examples/align_files.py is shown below. The return type of embed_loader.get_embed_list(...) is tensors whereas SentenceAligner.get_similarity requires numpy arrays.

                vectors = embed_loader.get_embed_list(list(sent_pair))
		if convert_to_words:
			w2b_map = []
			cnt = 0
			w2b_map.append([])
			for wlist in l1_tokens:
				w2b_map[0].append([])
				for x in wlist:
					w2b_map[0][-1].append(cnt)
					cnt += 1
			cnt = 0
			w2b_map.append([])
			for wlist in l2_tokens:
				w2b_map[1].append([])
				for x in wlist:
					w2b_map[1][-1].append(cnt)
					cnt += 1
			new_vectors = []
			for l_id in range(2):
				w_vector = []
				for word_set in w2b_map[l_id]:
					w_vector.append(vectors[l_id][word_set].mean(0))
				new_vectors.append(np.array(w_vector))
			vectors = np.array(new_vectors)

		all_mats = {}
		sim = SentenceAligner.get_similarity(vectors[0], vectors[1])
		sim = SentenceAligner.apply_distortion(sim, args.distortion)

This is problematic when --token-type = word since sklearn.metrics.pairwise.cosine_similarity isn't able to convert tensors to numpy array directly (because they also have gradients).

This is the exact error

  File "/home/ishan/simalign/simalign/simalign.py", line 110, in get_similarity
    return (cosine_similarity(X, Y) + 1.0) / 2.0
  File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 1179, in cosine_similarity
    X, Y = check_pairwise_arrays(X, Y)
  File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 134, in check_pairwise_arrays
    X, Y, dtype_float = _return_float_dtype(X, Y)
  File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 45, in _return_float_dtype
    X = np.asarray(X)
  File "/home/ishan/.local/lib/python3.6/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/home/ishan/.local/lib/python3.6/site-packages/torch/tensor.py", line 492, in __array__
    return self.numpy()
RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead.

Quick workaround is to add the line vectors = np.array(vectors.detach()) by adding the else clause to if convert_to_words

I believe this issue is solved with the new commit now. Could you please check again to make sure?