This repo contains a Java tokenizer used by RoBERTa model. The implementation is mainly according to HuggingFace Python RoBERTa Tokenizer, but also we took references from other implementations as mentioned in the code and below:
The algorithm used is a byte-level Byte Pair Encoding.
https://huggingface.co/docs/transformers/tokenizer_summary#bytelevel-bpe
- Clone the repo for explicit usage.
- Add the Maven dependency to your
pom.xml
for usage in your project:
<dependency>
<groupId>cloud.genesys</groupId>
<artifactId>roberta-tokenizer</artifactId>
<version>1.0.7</version>
</dependency>
<distributionManagement>
<repository>
<id>ossrh</id>
<url>https://s01.oss.sonatype.org/service/local/staging/deploy/maven2/</url>
</repository>
...
</distributionManagement>
- Unit tests - Run on local machine.
Since we want efficiency when initializing the tokenizer, we use a factory to create the relevant resources files and create it "lazily".
For this tokenizer we need 3 data files:
-
base_vocabulary.json
- map of numbers ([0,255]) to symbols (UniCode Characters). Only those symbols will be known by the algorithm. e.g., given s as input it iterates over the bytes of the String s and replaces each given byte with the mapped symbol. This way we assure what characters are passed. -
vocabulary.json
- Is a file that holds all the words(sub-words) and their token according to training. -
merges.txt
- describes the merge rules of words. The algorithm splits the given word into two subwords, afterwards it decides the best split according to the rank of the sub words. The higher those words are, the higher the rank.
Please note:
-
All three files must be under the same directory.
-
They must be named like mentioned above.
-
The result of the tokenization depends on the vocabulary and merges files.
String baseDirPath = "base/dir/path";
RobertaTokenizerResources robertaResources = new RobertaTokenizerResources(baseDirPath);
Tokenizer robertaTokenizer = new RobertaTokenizer(robertaResources);
...
String sentence = "this must be the place";
long[] tokenizedSentence = robertaTokenizer.tokenize(sentence);
System.out.println(tokenizedSentence);
An example output would be: [0, 9226, 531, 28, 5, 317, 2]
- Depends on the given vocabulary and merges files.
- Use temporary branches for every issue/task.