vishakhpk/iter-extrapolation

AAV2 - Questions about creating pairs and evaluation from the paper

Closed this issue · 4 comments

Hello,
Thank you for sharing the code. I have a few questions about the AAV2 experiment from the paper.

Questions about AAV2 data creation

I noticed that /aav/create_pairs.py file is using scored-train.json to create 1M training pairs.
However, I'm confused about how the mutations for scored-train.json are created.
Your paper mentions that the synthetic mutations for AAV2 were generated with the same strategy as ACE2 (Section 7.1 - Training the editor). In the ACE2 section (Section 6.1 - Training the editor) it is mentioned that

We sample token masks from a Bernoulli distribution with (p = 0.8)

I couldn't find the code that makes these masking and in-filling pairs. Can you point me to the relevant code and if possible share the scored-train.json file?

Questions about AAV2 evaluation

In Section 7.1 - Inference it is mentioned that

We start from the wild-type and run inference on the ICE model as per Section 3.3. When using the scorer, we sample 5 generations, score them with fs, select the best one, and repeat for 10 iterations. For the scorer-free setup, we generate with a beam size of 5 for 10 iterations.

This strategy would only yield 50 mutations after 10 iterations, however, in Section 7.1 - Evaluation, you mentioned evaluating 10K proteins. How, are these generated?

Apologies, the code base is a work in progress. The masking script for AAV is similar to this script in ACE2. The token mask ratio is a hyperparameter that you can vary here. I'll look to have the AAV version up soon too!

For inference, each instance of the iterative process secures you a fixed set of output sequences (i.e. 10 iterations w/ beach search 5 = 50). To obtain the library of 10k sequences, we run multiple instances of inference starting at the wildtype so in this case 200 difference inference runs would give you the desired library size.

Thank you for your prompt response. I wanted to recreate the experiment. Will it be possible to share the scored-train.json file that you used to create the training pairs?

I think so! Can you email me at vishakh@nyu.edu?

Thank you so much. I sent you an email.