AdrianOrenstein/json-1d-resnet

PythonMIT

Json sequence embedding

Given a JSON input such as:

{
  "input": {
    "title": "hello my name is",
    "subtitle": "another title",
    "address": {
      "road": "Corner Frome Road and, North Terrace",
      "city": "Adelaide",
      "state": "SA",
      "postcode": 5000
    }
  },
  "label": 1
}

To preprocess the input, the resnet key-value model preprocesses first into a list of flattened keys and values.

[
  { "title": "hello my name is" },
  { "subtitle": "another title" },
  { "address.road": "Corner Frome Road and, North Terrace" },
  { "address.city": "Adelaide" },
  { "address.state": "SA" },
  { "address.postcode": "5000" }
]

Then treats the whole key as a token, and tokenises each character individually.

tokens = ["title", "h", "e", "l", ..., "address.postcode", "5", "0", "0", "0"]

embedded_sequence: torch.LongTensor = tokeniser.convert_tokens_to_ids(tokens)

logits = key_value_resnet(embedded_sequence)

Pros

The model's vocabulary is setup to use all keys in the vocabulary + all printable runes from string.printable, this is a combination of digits, ascii_letters, punctuation, and whitespace.

Cons

Keys needs to have been added to the vocabulary of the model, any unknown keys will be assigned the "UNK" token

Usage

Json dataset

# first generate a schema for the model's vocabulary & the model's nn.Embed, we'll put the schema in the dataset folder
# this only needs to be run once
python -m src.dataset.json_files --train_data_path 'data' --test_data_path 'data' --schema_path 'data/schema.json' --write_to_path True

# then run the training script
python -m train --experiment_name "local_json_1d_resnet"

TODO

MVP
- start readme
- an example with homemade json dataset
- make a run_experiments/local_dataset_bench.sh
Extra
- look into pytorch-lightning-transformers?
- bench Yahoo! Answers because it has fast-text for comparison
- bench IMDB sequence classification problem make a run_experiments/imdb.sh and post results