poke1024/pyalign

Using vectors instead of characters

pseudo-rnd-thoughts opened this issue · 2 comments

I would like to use this outside of bioinformatics where for each character, it is a vector (np.ndarray) and distance function for computing the "distance" between vectors.
All your examples using strings, I was interested if this is possible with pyalign?

Yes, this is possible. Using pyalign.problems.general you can pass in any distance or similarity function. Here is an example code snippet that computes an alignment between words, where each word is represented through an embedding vector and word similarity is computed through cosine similarity between those vectors:

import pyalign

# compute some word embeddings
import spacy
nlp = spacy.load("en_core_web_md") 
import numpy as np
a = np.array([x.vector for x in nlp("old books and newer manuscripts")])
b = np.array([x.vector for x in nlp("recent writings")])

# solve alignment
from numpy.linalg import norm

def cosine_sim(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

pf = pyalign.problems.general(
    cosine_sim,
    direction="maximize")

solver = pyalign.solve.GlobalSolver(
    gap_cost=pyalign.gaps.LinearGapCost(0.2),
    codomain=pyalign.solve.Solution)

problem = pf.new_problem(a, b)

solver.solve(problem)

If you pass in a distance function (instead of an affinity as above), you would use:

pf = pyalign.problems.general(
    some_distance_func,
    direction="minimize")

Amazing, thanks