pisa-engine/pisa

BERT tokens

JMMackenzie opened this issue · 5 comments

Describe the solution you'd like
Currently, PISA does not readily support BERT wordpiece tokens exam ##pl because of the ## being eaten by the tokenizer.

We should have support for a command line flag like --pretokenized (similar to Anserini) to tell the tokenizer to simply consume whitespaces and do no more.

Checklist

  • Implement whitespace tokenizer (#496)
  • Allow for choosing tokenizer at query time (#499)
  • Allow for choosing tokenizer at indexing time

@JMMackenzie Do you by any chance have some Anserini docs on how this is implemented? I'm not that familiar with bert, I'd love to understand it a bit more.

If you check this commit, you will see the basically just instantiate a "whitespace analyzer" which does what it says on the tin: castorini/anserini@14b315d

This boils down to something like this: https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html

A tokenizer that divides text at whitespace characters as defined by [Character.isWhitespace(int)](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html?is-external=true#isWhitespace-int-). Note: That definition explicitly excludes the non-breaking space. Adjacent sequences of non-Whitespace characters form tokens.

I think for our intents/purposes, we can just tokenize directly on spaces. I think the only problem may be whether storing special characters will be handled correctly by the lexicon tooling, but I don't see why it wouldn't work. Any thoughts?

Basically this enhancement is for cases where we are ingesting a learned sparse index from either jsonl or another IR toolkit like Anserini/Terrier (perhaps via CIFF) which has a vocabulary which looks like:

#ing
...
fish
...

And then at query time we might see 101: fish ##ing locations or something like that. This example is just made up but should explain what we need.

I think currently PISA would turn that query into fish ing locations and then maybe match ing with the wrong token or just not find it.

Ah, ok, so this would be an alternative parsing, correct? When --pretokenized is passed, we break on spaces, otherwise, business as usual?

As for the lexicon, I don't see why it wouldn't work either. There's really nothing special about "special" characters like #. It's all just bytes.

If you have access to, or can get your hands on, a CIFF built this way (preferably not to large), it would be good to have it to do some sanity checks beyond any unit/integration tests we may write for that.

Sure, I can generate a CIFF file if that would help!