shyamsn97/mario-gpt

Trained from base model output characters seem to be wrong

jamessha opened this issue · 4 comments

Hi, very interesting project! I'm trying to reproduce the results training from base and running into a problem. Using the training notebook with default parameters on 20k steps, the model is converging to a loss of ~0.05. I'm getting reasonable looking outputs sampling from this trained model but the characters look wrong:

Screenshot 2023-10-18 at 2 19 03 PM

Any ideas on what's going wrong here?

I have the same issue.

Hey! Do you have your full code for generating? You need to create the dataset actually to "train" the tokenizer.

import torch
from mario_gpt import MarioDataset, MarioLM
from mario_gpt.utils import view_level, convert_level_to_png, join_list_of_list, characterize
mario_lm = MarioLM(lm_path="path_to_trained")

dataset = MarioDataset(mario_lm.tokenizer)

# now the tokenizer should be good
view_level(dataset.input_ids[:700], mario_lm.tokenizer)

I've been meaning to change this behavior, but for now this could help I think

Thanks for the response, turns out the offending line was
mario_lm = MarioLM(lm_path=lm_path, tokenizer_path='distilgpt2')
I'm not totally sure why I thought this was a good idea 🙃, using either the upstream tokenizer or saving the tokenizer after training works.

I also tried your suggestion, it works but you also need to manually set the lm tokenizer after doing so:
mario_lm.tokenizer = dataset.tokenizer

Appreciate the help!