ronboger/conformal-protein-retrieval

How can I implement it for protein-text retrieval?

Opened this issue · 1 comments

Dear authors,

Thank you for proposing this cool methodology! Currently I'm working on a cross-modality retrieval system named ProTrek and I'm very interested in engaging this metric into our evaluation system. I'm not familiar with conformal risk control so I would appreciate it if you could give me some suggestions to implement this pipeline.

Given a protein x, ProTrek will calculate similarity scores with all natural language descriptions (denoted as y1, y2, ..., yn) in the database and return scores s1, s2, ..., sn. I want to assign a probability to each score to represent how likely it is to be correct. I have a test set containing 4,000 proteins that were not seen during the training of ProTrek, which I think could use to evaluate and generalize to proteins input by users. Could you give me a pipeline for how can I implement this?

Thank you in advance and I sincerely look forward to your reply!

Best regards,
Jin

Hi Jin!

Sorry for the delay, glad to hear you liked our work! I enjoyed reading ProTrek and i think this methodology would be a really complimentary fit for producing rigorous evals + guarantees.

Based on what you described, the regime seem's complimentary to the style of risk control we employ on the paper, except here the query q is the protein (x), and the lookup database is your set of natural language descriptions y = {y_1, y_2, ..., y_n}, yielding cosine similarity scores in [0, 1]. You should be able to use the exact same pipeline as we use, where:

  1. Some subset of the 4,000 proteins (which have ground truth natural language descriptions for some match function against the lookup set of all langauge descriptions) need to go into your calibration set. Ideally, if you can sample uniformly with respect to the natural language label abundances (i.e.: where you calibration set has a distribution of language annotations that is approximately the same as the test set), and pick some n > 300-400 for calibration. You repeat this shuffling of the 4,000 into calibration and test over multiple trials.
  2. From here, you just need to create a distance matrix with all the scores against all proteins, so with shape (4000, num_language_descriptions). Then sort each row in descending order of highest to lowest similarity, and have a seperate matrix which stores the re-arranged indicies.
  3. Presuming you have the labels for each of the 4,000: you can just score matches by sorting from greatest to least similarity, finding some threshold $\lambda_s$ score that holds for the loss you wish to control (i.e: I want to control false discoveries of natural language descriptions to be no more then 10% of the set).

In our code, we store lists where the length is the number of queries, and each element is a dictionary containing the sorted similarity scores, label matches, indices, etc to make calibration and testing easy, but there are many ways to do so. Here is a notebook that shows how we pre-process the data for calibration in the Pfam study of the paper.

The main question to decide before using this is what consists of a match - are there multiple acceptable descriptions for each protein? Happy to help brainstorm further and help you engage this metric into your models.