microsoft/evodiff

Question on MaxHamming distance

BSharmi opened this issue · 1 comments

Hi there!

Could someone please help explain the maxhamming distance in https://github.com/microsoft/evodiff/blob/main/evodiff/data.py#L60

curr_dist = cdist(random_seq, msa_subset, metric='hamming')
curr_dist = np.expand_dims(np.array(curr_dist), axis=0)  # shape is now (1,msa_num_seqs)
distance_matrix[i] = curr_dist
col_min = np.min(distance_matrix, axis=0)  # (1,num_choices)
max_ind = np.argmax(col_min)

Why do we get the minimum hamming distance for the random sequence wrt the msa instead of maximum and then do an argmax?
I may have missed the details but as far as I understand we need to get the sequences that have more mutations with respect to the anchor sequences in msa if we need diversity?

Thank you!

We want a set of sequences that are diverse not just from the query sequence but from each other as well (note that we update random_seq at each iteration in our algorithm). If the goal were to maximize the Hamming distance without considering the minimum across sequences, it might lead to a set of sequences that have a large overall spread in terms of Hamming distance, but may still have pairs of sequences that are quite similar to each other. In our setup, we ensure that each selected sequence is as different as possible from the rest of the sequences already in the set. I hope this clarifies your question!