Feature: Generic algorithm for word-tokens, Unicode, and many other encodings including non-text sequences.
qwertystop opened this issue · 6 comments
Word-tokens and Unicode were requested by @jcjohnson in #178, here. I see a potential algorithm by which to get both at once, as well as many other encodings (including but not limited to non-text encodings). It is as follows:
The preprocessor has a maximum token length, M
. It consumes a certain amount of bytes at once, B (for e.g. UTF-16). It can be given a list of specific values or ranges of values. Each such value or range of values is associated with a different token length M'
. When a listed value is encountered in input, M
is reassigned to M'
. Non-listed values continue using whatever it was last set to. Each time a new value is read from input:
- Compare
M
to the length of the tokenT
accumulated so far, plus the length ofS
. - If
M = len(T+S)
, the token is over, storeT+S
in the numpy array. - If
M < len(T+S)
, assignT+S
toT
, and go on to the next byte. - If
M > len(T+S)
, this is an encoding error in some cases (single-codepoint tokens in UTF-8 or UTF-16), but a character used as a separator in others (word tokens). We could have a flag for whether to raise an exception or treat it as the end of a token. Or just always treat it as the end. If treating it as the end of a token, storeT
in the numpy array, storeS
in the array, and proceed.
These rules, taken together in various configurations, allow the following (ranges are inclusive at both ends):
- UTF-8:
B = 1
, length table{range(0x00, 0x7F): 1, range(0xC0, 0xDF): 2, range(0xE0, 0xEF): 3, range(0xF0, 0xFF): 4}
- UTF-16:
B = 2
, length table{range(0x0000, 0xD7FF): 1, range(0xE000, FFFF): 1, range(0xD800, DFFF): 2}
- UTF-32:
B = 4
, empty length table, defaultM = 1
. - Whitespace-and-punctuation-separated words: B as appropriate for the encoding, table has the length of the longest word you're willing to store for all characters and 1 for all whitespace/punctuation on which you split.
More generally, any fixed-width encoding, text or otherwise, can be handled in the same fashion as UTF-32. Any variable-width encoding which marks the length of a token with a fixed-length subsequence at its start can be handled in the same fashion as UTF-8 and UTF-16. Any variable-width encoding which marks the length of a token with a fixed-length subsequence at its end can be handled in the same manner as whitespace-and-punctuation-separated words.
That should be enough to get the ability to process arbitrary sequential data, such as constant-bit-rate audio.
I do intend to code this up myself soon, but thought it might be a good idea to put the algorithm out there first in case I'm missing something that should have been obvious, and also in case I forget to code this before I get the chance to do so.
I already have a reimplementation of the preprocessor, available in PR#132 that does some of this, including text culling. I am working on expanding that functionality as we speak, so maybe we could start with that as a base.
After a bit of examining my own code, it turns out that my preprocess.py script in PR#132 already handles UTF-16 and UTF-32 (both endian).
Constant-bit-rate audio is a terrible input directly to a nn, I would recommend using some sort of encoding scheme to extract features over time steps.
That was mostly just the first example that came to mind for a sequential data type.
I think there's a decent chance an NN could make something of it, though. You'd need to give it a much longer memory than for text, of course. If you extract features, you've basically switched over to using MIDIs or sheet music derived from the recording - an entirely valid target, and one which fine work has been done on, but not really the same.
The main issue is you need a lot more long-term coherency for audio samples to form decent music than for text to form decent writing. But again, that might just be an issue of providing a long enough memory.
The script I wrote uses the Codec library to decode tokens, which can accept extendable codecs. Therefore, all you would have to do is write a codec to decode whatever you want. I think that is reasonable to accept arbitrary data.
Entirely reasonable, yes. My algorithm seems to be redundant overkill compared to your script, then. Should I close this, or should that wait until your PR gets merged?
Whatever you want. I'm going to wait until py3 support is added in to merge