similarity alignment of sentences

Question

similarity alignment of sentences

Closed this issue 4 years ago · 10 comments

I know this might sound irrelevant, but can the logic of aligning words in two sentences be used to align sentences in two articles?

Answer 1 · 2020-10-12T06:57:13.000Z

Hi @jiangweiatgithub if I understand you correctly this sounds like parallel sentence mining or sentence retrieval. You can use BERT for such a task but I guess there are alternatives that work much better (maybe checkout Sentence-BERT, classical approaches like tf-idf, or methods proposed in the BUCC shared task for parallel sentence mining).

Answer 2 · 2020-10-12T07:11:29.000Z

Thank you for your prompt response, @pdufter ! You got me right. I had checked BUCC, but it seems that all the entries' calculation of sentence similarity does really consider the context that a specific sentence is located. I guess such context info is used by simalign, right?

Answer 3 · 2020-10-12T07:14:00.000Z

SimAlign does not consider any cross-sentence context. Is that what you meant?

Answer 4 · 2020-10-12T07:43:06.000Z

I mean, when SimAlign is trying aligning a word - in Sentence A with two or more possible words in Sentence B, it will give more weight to the one that is located within the context - the word before and/or the word after - that is already aligned for sure. For example:
Sentence A in English : I like buying books, not reading books.
Sentence B in Chinese, words segmented by space: 我喜欢买书，而非读书。

As you might see or guess, both instances of “book” are translated into "书", Assuming "buying" & "," and "reading" and "." have been aligned, both instances of "book" should be aligned as well.

Answer 5 · 2020-10-12T07:55:29.000Z

Yes, given that mBERT computes contextualized embeddings ans SimAlign just uses them directly means that this context is considered. Also for this kind of alignment positional embeddings have a big (not always good) influence.

Answer 6 · 2020-10-12T08:16:15.000Z

Is this related to the distortion correction parameter?

Answer 7 · 2020-10-12T09:03:37.000Z

The distortion parameter can be used to push alignments more to the diagonal (i.e., similar relativ position in the sentences). But mBERT already has position embeddings which yield a similar effect, thus the distortion parameter does not have a big impact when using mBERT.

Answer 8 · 2024-07-22T00:04:43.000Z

Recently I revisited this sentence alignment, by passing in two lists of whole sentences and got some meaningful alignment result. Is that intentional or just accidental?

Answer 9 · 2024-07-22T10:09:14.000Z

Hm, not sure about this. There is generally not intention for sentence alignment in SimAlign. Can you share more details what you did exactly?

Answer 10 · 2024-07-23T00:01:59.000Z

Here you go with the code:

from simalign import SentenceAligner

def read_first_n_lines(file_path, n):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = [file.readline().strip() for _ in range(n)]
    return lines

# making an instance of our model.
# You can specify the embedding model and all alignment settings in the constructor.
myaligner = SentenceAligner(model="bert", token_type="bpe", matching_methods="mai")

# The source and target sentences should be tokenized to words.
src_sentence = read_first_n_lines(r'X:\repos\similarity_analysis\man_065.txt',65)
trg_sentence = read_first_n_lines(r'X:\repos\similarity_analysis\woman_051.txt',51)
print(len(src_sentence))
print(len(trg_sentence))

# The output is a dictionary with different matching methods.
# Each method has a list of pairs indicating the indexes of aligned words (The alignments are zero-indexed).
alignments = myaligner.get_word_aligns(src_sentence, trg_sentence)

#print(alignments)
for matching_method in alignments:
    print(matching_method, ":", alignments[matching_method])