/Tokenization

some tests with tokenization

Primary LanguagePython

Some tests with tokenization based on the *Improving the tokenisation of identifier names* paper

steps in the paper:

1. Identifier names are first tokenised using boundaries marked by separator
characters or the transitions between letters and digits.
2. The tokens from step 1 are investigated for the presence of changes from
lower case to upper case (the primary internal capitalisation boundary) and
split on those boundaries.
3. Tokens found to contain the UCLC boundary – as found in HTMLEditor –
are investigated using an oracle to determine whether splitting the token
following the penultimate upper case letter, or at the change from upper to
lower case results in a better tokenisation.
4. Each token is investigated using a recursive algorithm with the support of
an oracle to determine whether it can be divided further.

-----------------------------------------------------------------------------
1. Identifier names are tokenised using separator characters and the internal
capitalisation boundaries.
2. Any token containing the UCLC boundary is tokenised with the support of
an oracle.
3. Any identifier names with tokens containing digits are reviewed and to-
kenised using an oracle and a set of heuristics.
4. Any identifier name composed of a single token is investigated to determine
whether it is a recognised word or a neologism constructed from the simple
addition of known prefixes and suffixes to a recognised word.
5. Any remaining single token identifier names are tokenised by recursive al-
gorithms. Candidate tokenisations are investigated to reduce oversplitting,
before being scored with weight being given to tokens found in the project-
specific vocabulary.