Feature: Generic algorithm for word-tokens, Unicode, and many other encodings including non-text sequences.

Question

Feature: Generic algorithm for word-tokens, Unicode, and many other encodings including non-text sequences.

qwertystop opened this issue 8 years ago · 6 comments

Word-tokens and Unicode were requested by @jcjohnson in #178, here. I see a potential algorithm by which to get both at once, as well as many other encodings (including but not limited to non-text encodings). It is as follows:

The preprocessor has a maximum token length, M. It consumes a certain amount of bytes at once, B (for e.g. UTF-16). It can be given a list of specific values or ranges of values. Each such value or range of values is associated with a different token length M'. When a listed value is encountered in input, M is reassigned to M'. Non-listed values continue using whatever it was last set to. Each time a new value is read from input:

Compare M to the length of the token T accumulated so far, plus the length of S.
If M = len(T+S), the token is over, store T+S in the numpy array.
If M < len(T+S), assign T+S to T, and go on to the next byte.
If M > len(T+S), this is an encoding error in some cases (single-codepoint tokens in UTF-8 or UTF-16), but a character used as a separator in others (word tokens). We could have a flag for whether to raise an exception or treat it as the end of a token. Or just always treat it as the end. If treating it as the end of a token, store T in the numpy array, store S in the array, and proceed.

These rules, taken together in various configurations, allow the following (ranges are inclusive at both ends):

UTF-8: B = 1, length table {range(0x00, 0x7F): 1, range(0xC0, 0xDF): 2, range(0xE0, 0xEF): 3, range(0xF0, 0xFF): 4}
UTF-16: B = 2, length table {range(0x0000, 0xD7FF): 1, range(0xE000, FFFF): 1, range(0xD800, DFFF): 2}
UTF-32: B = 4, empty length table, default M = 1.
Whitespace-and-punctuation-separated words: B as appropriate for the encoding, table has the length of the longest word you're willing to store for all characters and 1 for all whitespace/punctuation on which you split.

More generally, any fixed-width encoding, text or otherwise, can be handled in the same fashion as UTF-32. Any variable-width encoding which marks the length of a token with a fixed-length subsequence at its start can be handled in the same fashion as UTF-8 and UTF-16. Any variable-width encoding which marks the length of a token with a fixed-length subsequence at its end can be handled in the same manner as whitespace-and-punctuation-separated words.

That should be enough to get the ability to process arbitrary sequential data, such as constant-bit-rate audio.

I do intend to code this up myself soon, but thought it might be a good idea to put the algorithm out there first in case I'm missing something that should have been obvious, and also in case I forget to code this before I get the chance to do so.

Answer 1 · 2017-04-27T01:14:58.000Z

I already have a reimplementation of the preprocessor, available in PR#132 that does some of this, including text culling. I am working on expanding that functionality as we speak, so maybe we could start with that as a base.

Answer 2 · 2017-04-27T02:46:45.000Z

After a bit of examining my own code, it turns out that my preprocess.py script in PR#132 already handles UTF-16 and UTF-32 (both endian).

Constant-bit-rate audio is a terrible input directly to a nn, I would recommend using some sort of encoding scheme to extract features over time steps.

Answer 3 · 2017-04-27T03:04:36.000Z

That was mostly just the first example that came to mind for a sequential data type.

I think there's a decent chance an NN could make something of it, though. You'd need to give it a much longer memory than for text, of course. If you extract features, you've basically switched over to using MIDIs or sheet music derived from the recording - an entirely valid target, and one which fine work has been done on, but not really the same.

The main issue is you need a lot more long-term coherency for audio samples to form decent music than for text to form decent writing. But again, that might just be an issue of providing a long enough memory.

Answer 4 · 2017-04-27T03:15:18.000Z

The script I wrote uses the Codec library to decode tokens, which can accept extendable codecs. Therefore, all you would have to do is write a codec to decode whatever you want. I think that is reasonable to accept arbitrary data.

Answer 5 · 2017-04-27T14:31:14.000Z

Entirely reasonable, yes. My algorithm seems to be redundant overkill compared to your script, then. Should I close this, or should that wait until your PR gets merged?

Answer 6 · 2017-04-27T18:50:53.000Z

Whatever you want. I'm going to wait until py3 support is added in to merge