cisnlp/simalign

similarity alignment of sentences

Closed this issue · 10 comments

I know this might sound irrelevant, but can the logic of aligning words in two sentences be used to align sentences in two articles?

Hi @jiangweiatgithub if I understand you correctly this sounds like parallel sentence mining or sentence retrieval. You can use BERT for such a task but I guess there are alternatives that work much better (maybe checkout Sentence-BERT, classical approaches like tf-idf, or methods proposed in the BUCC shared task for parallel sentence mining).

Thank you for your prompt response, @pdufter ! You got me right. I had checked BUCC, but it seems that all the entries' calculation of sentence similarity does really consider the context that a specific sentence is located. I guess such context info is used by simalign, right?

SimAlign does not consider any cross-sentence context. Is that what you meant?

I mean, when SimAlign is trying aligning a word - in Sentence A with two or more possible words in Sentence B, it will give more weight to the one that is located within the context - the word before and/or the word after - that is already aligned for sure. For example:
Sentence A in English : I like buying books, not reading books.
Sentence B in Chinese, words segmented by space: 我 喜欢 买 书,而非 读 书 。

As you might see or guess, both instances of “book” are translated into "书", Assuming "buying" & "," and "reading" and "." have been aligned, both instances of "book" should be aligned as well.

Yes, given that mBERT computes contextualized embeddings ans SimAlign just uses them directly means that this context is considered. Also for this kind of alignment positional embeddings have a big (not always good) influence.

Is this related to the distortion correction parameter?

The distortion parameter can be used to push alignments more to the diagonal (i.e., similar relativ position in the sentences). But mBERT already has position embeddings which yield a similar effect, thus the distortion parameter does not have a big impact when using mBERT.

Recently I revisited this sentence alignment, by passing in two lists of whole sentences and got some meaningful alignment result. Is that intentional or just accidental?

Hm, not sure about this. There is generally not intention for sentence alignment in SimAlign. Can you share more details what you did exactly?

Here you go with the code:

from simalign import SentenceAligner

def read_first_n_lines(file_path, n):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = [file.readline().strip() for _ in range(n)]
    return lines

# making an instance of our model.
# You can specify the embedding model and all alignment settings in the constructor.
myaligner = SentenceAligner(model="bert", token_type="bpe", matching_methods="mai")

# The source and target sentences should be tokenized to words.
src_sentence = read_first_n_lines(r'X:\repos\similarity_analysis\man_065.txt',65)
trg_sentence = read_first_n_lines(r'X:\repos\similarity_analysis\woman_051.txt',51)
print(len(src_sentence))
print(len(trg_sentence))

# The output is a dictionary with different matching methods.
# Each method has a list of pairs indicating the indexes of aligned words (The alignments are zero-indexed).
alignments = myaligner.get_word_aligns(src_sentence, trg_sentence)

#print(alignments)
for matching_method in alignments:
    print(matching_method, ":", alignments[matching_method])