Sequence-Domain Encompassed Correlations (SDEC)

Official Repo v0.1.2

Notes

Yo! sorry this is currently a mess. SDEC was created in response to a challenge from @MarkRober to figure out some sequential steal encodings. I was really just running & gunning under a deadline so a lot of code here is ugly & confusing, but ill be updating this rep from here on out.

So What is SDEC?

So SDEC stands for Sequence-Domain Encompassed Correlations, & it's purpose is to take the entire domain of a given series of seqences, & encode it's encompassed correlations. This is done by defining what's called a "resolution" to limit how many correlations you want the your model to learn from.

At the moment, this idea is a bit clunky & not yet full proof to be a useful contribution to the machine learning space, but I will continue to develop, test, & improve this implimentation over time.

What you will need:

Need	Description
ML / Tensorflow / Keras Understanding	This isn't actually needed, I think you would be able to get by without it, but you may hit a point in which it might be hard to debug whats going on without this knowledge, just be warned :D
A Dataset	Now I have a simple classification example dataset located @Example/data, the dataset on the left if a sequence of anything ideally, but in this case button presses, & on the right is the binary label, if the sequence has a Hadouken in it or not (dfp), but you could also supply your own dataset.
Neural Network Architecture	SDEC is simply a natural language encoder, you will still need to optimize your own Neural Network architectures. You can modify this in the script Example/train.py line 61

How to install Dependencies:

pip install -r requirements.txt

How to encode a Dataset:

SDEC is made to work with text only
If you can encode all features of a single step in your time series (or text) into a single representation, SDEC should work well for you.
Your text must be encoded using ASCII (untested on utc-8 or 16 for now, might still work)

How to Use SDEC:

There is an example that will give you an idea how to use the SDEC library. You'll want to run Example/hub.py in the terminal, use the line below to get further instructions on how to use it

python hub.py -h

This will return the following:

usage: hub.py [-h] [-mn MODEL_NAME] [-f FILE] [-md MODELS_DIR] [-dd DATA_DIR]
              [-e EPOCHS] [-sr SAVE_RATE] [-T TOP] [-b BATCHES]
              [-spe STEPS_PER_E] [-rlf RL_FACTOR] [-rlp RL_PATIENCE]
              [-res RESOLUTION [RESOLUTION ...]] [-t] [-ho] [-hom] [-i] [-p]
              [-lm]

optional arguments:
  -h, --help            show this help message and exit
  -mn MODEL_NAME, --model_name MODEL_NAME
                        the name you want your model to be saved as (default:
                        model)
  -f FILE, --file FILE  the name you want your model to be saved as (default:
                        train.txt)
  -md MODELS_DIR, --models_dir MODELS_DIR
                        the location you want your models to be saved in
                        (default: SDEC_Model)
  -dd DATA_DIR, --data_dir DATA_DIR
                        the location containing your data (default: data)
  -e EPOCHS, --epochs EPOCHS
                        use -e to set the number of epochs for training
                        (default: 100)
  -sr SAVE_RATE, --save_rate SAVE_RATE
                        use -sr to set the save rate per x epochs (default:
                        100)
  -T TOP, --top TOP     use -t to (default: 3)
  -b BATCHES, --batches BATCHES
                        use -b to set the number to batch for training
                        (default: 2048)
  -spe STEPS_PER_E, --steps_per_e STEPS_PER_E
                        use -spe to set the number of steps per epochs for
                        training (default: 0)
  -rlf RL_FACTOR, --rl_factor RL_FACTOR
                        use -spe to set the number of steps per epochs for
                        training (default: 0.5)
  -rlp RL_PATIENCE, --rl_patience RL_PATIENCE
                        use -spe to set the number of steps per epochs for
                        training (default: 105)
  -res RESOLUTION [RESOLUTION ...], --resolution RESOLUTION [RESOLUTION ...]
                        use -res to set the resolution (default: [2, 3])
  -t, --train           add -t if you want to train (default: False)
  -ho, --handoff        add -ho if you want to the AI to plot a distr & hand
                        it off to you (default: False)
  -hom, --handoffmulti  add -hom if you want to the AI to plot a distr & hand
                        it off to you (default: False)
  -i, --init            add -i if you want to initilize from some data
                        (default: False)
  -p, --predict         add -p if you want to predict (default: False)
  -lm, --load_model     add -lm if you want to load the model for further
                        training (default: False)

Clear Steps for using the Example

Understand your domain
- This includes every possible event that can occur in your time series or sequence
Encode your domain to have their own unique ASCII representation
Create a .txt file with every single sequence ASCII encoded, with a label next to it separated with a \t (tab), & new datapoints separated by \n (new line)
Use Example/hub.py with the -i flag to initialize a config file for your dataset
Use Example/hub.py with the -t flag to train on your dataset
- You must also point to the file that you want to train on
Use Example/hub.py with the -p flag to predict on a dataset
- You must also point to the file that you want to predict on, as well as the model that you want to use to predict

Roadmap

click here to see full changelog.

Task
Better error messages
Classification on more than binary output
Adding an "Unknown" representation so unknown don't return errors.
Adding subprocesses to the hub.py script for less confusion with -h
SDEC can process seqences that are shorter than the encoding (this might be for gendataset.py only lol)
save `conf.mc` anytime you save a model
Clean up the scripts
Get Benchmarks

Usecases

Use	Details
Discover Hidden Codes	Consider you have a sequence that contains hidden codes within it, SDEC can be used to decipher those hidden codes, so long as the code has some sort of sequential dependency.
Classify Sequences (only binary for now)	SDEC can be used to classify sequences.
Encoder	Can use as encoder step for other networks like Autoencoder, SDEC to latent space, decoded to whatever

Jabrils/SDEC