Trained from base model output characters seem to be wrong

Question

Trained from base model output characters seem to be wrong

jamessha opened this issue a year ago · 4 comments

Hi, very interesting project! I'm trying to reproduce the results training from base and running into a problem. Using the training notebook with default parameters on 20k steps, the model is converging to a loss of ~0.05. I'm getting reasonable looking outputs sampling from this trained model but the characters look wrong:

Any ideas on what's going wrong here?

Answer 1 · 2023-10-19T04:43:08.000Z

I have the same issue.

Answer 2 · 2023-10-19T05:53:15.000Z

Hey! Do you have your full code for generating? You need to create the dataset actually to "train" the tokenizer.

Answer 3 · 2023-10-19T06:40:25.000Z

import torch
from mario_gpt import MarioDataset, MarioLM
from mario_gpt.utils import view_level, convert_level_to_png, join_list_of_list, characterize
mario_lm = MarioLM(lm_path="path_to_trained")

dataset = MarioDataset(mario_lm.tokenizer)

# now the tokenizer should be good
view_level(dataset.input_ids[:700], mario_lm.tokenizer)

I've been meaning to change this behavior, but for now this could help I think

Answer 4 · 2023-10-20T02:55:54.000Z

Thanks for the response, turns out the offending line was
mario_lm = MarioLM(lm_path=lm_path, tokenizer_path='distilgpt2')
I'm not totally sure why I thought this was a good idea 🙃, using either the upstream tokenizer or saving the tokenizer after training works.

I also tried your suggestion, it works but you also need to manually set the lm tokenizer after doing so:
mario_lm.tokenizer = dataset.tokenizer

Appreciate the help!