karpathy/llama2.c

Trained and LoRA fine-tuned the models to follow instructions for writing tiny stories

cindysridykhan opened this issue · 8 comments

I trained and LoRA fine-tuned (inspired from wlamond's PR) the models to follow instructions and write tiny stories accordingly, with the prompt data available.
Repo
blogpost

Demo:
story1080

Any feedback is welcome!
(Please tell me if it is not appropriate to post this here)

Thanks for sharing. Probably on Discord you'll get more visibility.

@cindysridykhan very good idea!

It would be nice to have something like this also for chat, because currently chat functionality works only for very big 7B model and people with no powerful PC cannot test/run it.

Please, see this #357

What do you think? Are you interested in training tinistories also for chat?

THanks Cindy! You got my claps on Medium :)
Really, please share this on the discord channel. I'm sure the llama2.c community at large will be interested!

This is awesome and exactly the next step I want to look at, so your instructions will be helpful. Thank you!

Thank you so much for your comments and suggestions!
I am currently busy with another project but will think about the chat problem @xefoci7612 !

Thank you @cindysridykhan for your kind answer. Have a nice day!

This was definitely instructive and helped me figure out a nice training method.

So, you can use the prompting as Cindy does, but I was in the process of creating a synthetic dataset, so I had it write a summary for the story like this:
"{ "story" : "Once upon a time...",
"summary" : "a story about..." }

It was very important the summary have that form to start, because then I prepend "Write " to it to get my instruction. This allows my instructions to be more natural e.g. "Write a story about a boy going fishing and catching a lot of fish"

However, you can straight use the prompt-reformatting that Cindy has provided, you will just need to stick to that template when instructing.

I wanted to maintain as much of the original Llama2c code as I could, so I went back to Cindy's first idea of just training a next character predictor. The dataset is tokenized like :

prompt + [bos_id] + story + [eos_id]

This is again, just as cindy has done, except I don't pad, truncate or mask

I added this to tinystories.py process shard

if instruct_train:
    prompt = "Write " + example["summary"].strip()
    tokenized_prompt = enc.encode( prompt, bos=False, eos=False)
    tokenized_story = enc.encode(text, bos=False, eos=False)
    tokens = tokenized_prompt + [enc.bos_id] + tokenized_story + [enc.eos_id]
else:
    tokens = enc.encode(text, bos=True, eos=False)  # encode the text, use BOS

Then, I just train normally (using default llama2c train.py). I don't change the loss or modify cross-entropy, instead, opting to clean it all up in post. So I just modify the sampling script to cut out the portions I don't want.

If I want to sample as a completion, the input prompt comes in as [bos_id] + "once upon a time "
and the model continues on to produce an eos_id. I then cut off anything after the eos_id

If I want to sample as an instruction, the input prompt comes in as [eos_id] + prompt + [bos_id], and the model continues on to produce an eos_id. In this case, I cut off everything before bos_id (the original prompt), and everything after the generated [eos_id].

if not instruct_train:
    start_ids = enc.encode(start, bos=True, eos=False)
else:
    if follow_instruction:
        start_ids = [enc.eos_id] + enc.encode(start, bos=False, eos=False) + [enc.bos_id]
    else: # completion mode
        start_ids = [enc.bos_id] + enc.encode(start, bos=False, eos=False)

Then when the generation comes out, you just need to use python list.index to find your bos/eos ids and cut off the ids before decoding. I'm not putting that in here as it's straightforward, but is a bit long and ugly to capture the three modes.

The nice thing about this is that you can use it in "completion mode" or "instruct mode" with minor changes to how you prompt it. The tokenization does not need to know what sequence length you are using, so that is free to vary as well. And you don't need to change the loss function because of the masks.

I want to again thank Cindy for posting this, as it definitely sped the work above having someone already lighting the path to see what possibilities existed.

New subsequent post on some of my higher level learnings on this...

Generally with these models, they get trained in three passes

  1. pretraining
  2. Supervised Fine Tuning (SFT)
  3. Reinforcement Learning with Human Feedback (RLHF)

I'm going to ignore (3) at the moment because I haven't gotten that far, and my insight is about phases {1} and {2}

While I knew this, it wasn't clear to me why each steps were necessary. On my first few instructions, it became apparent.

Write a story about a robot walking his dog and finding love along the way

Timmy loved his new pet, a fast, chute little pup named Max. One day, Timmy decided to take Max for a walk outside on a beautiful and peaceful day. As they strolled along the broad horizon, Timmy noticed a little bird perched on a branch.

"Hello there, little bird," Timmy said in a sweet chirp. The bird looked up and smiled at Timmy, so happy to see his furry friend. Timmy decided to let Max explore the village on his bike and ride his bike around all summer to see for days. The next day, Timmy saw a beautiful picture of Max and watched Rosalong. He was so excited to see the whole boy and show him all the beautiful things he had done as a bird.

There is some nonsense above, mainly because I cut training early and it is a 15M parameter model, so likely a bit small to generate consistency. However, also note...there is no mention of a robot...maybe Timmy is a robot, but the story just never mentions it? Hmm... Let's try again

Write a story about a boy crossing a bridge with a troll under it.

As soon as Jake stepped onto the bridge, he knew it was going to be the fastest climb in the park. The trolls glided through the trees, their colorful stripes and rusty tides teeming with uneven peaks. Jake was determined to build the most epic tower that the park had ever seen – a sandy-colored taller with glambled waves and tangled banners.

As he made his way through the londus, the trolls met a few cute kids playing outside. Jake decided to join them, his own shoelaces pounding in the wooden empty glow. Just as he reached the top of the rickety bridge, a deer darted onto the lawn, causing Jake to panic. But then, he remembered the moral his mother had taught him: "When life gives you lemons, make lemonade."

With newfound determination, Jake stood there, his eyes fixed on the treetops plastered around him. He knew that he could see for miles, and he had to keep running. He pushed aside his fear and focused on the towering trends that surrounded them. It was a beautiful afternoon, and Jake felt grateful for the adventure he had washed up on the ground. He knew that he could always return to the park and try to care for those who needed it most.

Again, a lot of nonsense, but that's not what catches my eye. It does mention trolls, but they don't seem like the trolls I'm alluding to. Then it hit me - this is all from scratch instruction trained. What are the chances my stories talk about robots or trolls? Even if they do, how much coverage do I get on those two topics? This is where I realized the importance of the pretraining step. That step is to just teach the network concepts. Otherwise, you are restricted to talk about only the topics in your dataset, and even if you have them, in a dataset like tinystories, the concepts are unlikely to be fleshed out. Phase 2 and 3 are essentially alignment steps that say "take the knowledge you have and respond like this".

That said, It seemed to do better if I stuck with topics that might have come up frequently in my dataset:

Write a story about a boy fishing and catching a lot of fish.

Timmy loved fishing. He spent hours each day watching the ducks by the pond. One day, while his parents were out, Timmy decided to go fishing. He set up his line and waited patiently. As he waited, he noticed that the ratio of fish he caught was equivalent to his fishing line.

Timmy was disappointed, but he decided to try again. He cast his line and waited patiently. After a few minutes, he felt a tug. He pulled and pulled, finally reeling a beautiful fish. It was a big trive, and Timmy was thrilled. His parents told him that fishing was a regular part of his hobbies and that he wanted to spend a few hours enjoying his time with them. Timmy was thrilled and couldn't wait to go back and fish again.

One final insight I think I'm developing, is that the strength of tiny stories is not necessarily in it's reduced vocabulary. I need to train a net with a smaller vocab, as I'm using default llama2 32,000 vocab size, but what I'm seeing above seems consistent with the quality I see in the tiny stories 15M results. So we should be able to get coherent English in models meant for even adult users at 100-200M parameters. I'm waiting for more data to generate, then I'll train a 100-200M parameter model and see if consistency and quality improves.

So my next step is going to see what types of datasets I can get access to to pretrain on, then take my stories dataset and fine tune on it and see how it improves.