BERT tokens

Question

BERT tokens

JMMackenzie opened this issue 2 years ago · 5 comments

Describe the solution you'd like
Currently, PISA does not readily support BERT wordpiece tokens exam ##pl because of the ## being eaten by the tokenizer.

We should have support for a command line flag like --pretokenized (similar to Anserini) to tell the tokenizer to simply consume whitespaces and do no more.

Checklist

Implement whitespace tokenizer (#496)
Allow for choosing tokenizer at query time (#499)
Allow for choosing tokenizer at indexing time

Answer 1 · 2022-11-25T00:54:32.000Z

@JMMackenzie Do you by any chance have some Anserini docs on how this is implemented? I'm not that familiar with bert, I'd love to understand it a bit more.

Answer 2 · 2022-11-25T06:19:51.000Z

If you check this commit, you will see the basically just instantiate a "whitespace analyzer" which does what it says on the tin: castorini/anserini@14b315d

This boils down to something like this: https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html

A tokenizer that divides text at whitespace characters as defined by [Character.isWhitespace(int)](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html?is-external=true#isWhitespace-int-). Note: That definition explicitly excludes the non-breaking space. Adjacent sequences of non-Whitespace characters form tokens.

I think for our intents/purposes, we can just tokenize directly on spaces. I think the only problem may be whether storing special characters will be handled correctly by the lexicon tooling, but I don't see why it wouldn't work. Any thoughts?

Answer 3 · 2022-11-25T06:23:53.000Z

Basically this enhancement is for cases where we are ingesting a learned sparse index from either jsonl or another IR toolkit like Anserini/Terrier (perhaps via CIFF) which has a vocabulary which looks like:

#ing
...
fish
...

And then at query time we might see 101: fish ##ing locations or something like that. This example is just made up but should explain what we need.

I think currently PISA would turn that query into fish ing locations and then maybe match ing with the wrong token or just not find it.

Answer 4 · 2022-11-25T14:27:20.000Z

Ah, ok, so this would be an alternative parsing, correct? When --pretokenized is passed, we break on spaces, otherwise, business as usual?

As for the lexicon, I don't see why it wouldn't work either. There's really nothing special about "special" characters like #. It's all just bytes.

If you have access to, or can get your hands on, a CIFF built this way (preferably not to large), it would be good to have it to do some sanity checks beyond any unit/integration tests we may write for that.

Answer 5 · 2022-11-28T01:21:58.000Z

Sure, I can generate a CIFF file if that would help!