Better alignment than "per-token" for SMC
Opened this issue · 0 comments
alex-lew commented
When to resample in SMC? Currently particles are aligned by number-of-tokens, so when we resample, all particles have the same number of tokens (unless some have already hit EOS). But this isn't really fair. For example:
- When intersecting "My favorite physicist is" and "My favorite writer is", we end up comparing particles that say, e.g., " Richard Feynman. He was" and " Neil deGrasse Tyson" -- when we really want to compare " Richard Feynman" to " Neil deGrasse Tyson".
- When intersecting "A great personal finance tip is" and "A great tip for healthy living is", we end up comparing particles that say, e.g., " to avoid eating out" and " to make sure you're". The former loses out, intuitively because its weight already factors in the semantic constraints whereas they largely 'withhold judgment' on the vaguer latter particle.
It would be great to find a clear theoretical framework for thinking about these intermediate distributions, and other heuristics (or principled strategies) for alignment.
One heuristic worth trying might be to resample at syntax-directed points -- at the end of each sentence, clause, or some other grammatical element.