After doing all the process to generate the project. There are no Checkpoints.

Question

After doing all the process to generate the project. There are no Checkpoints.

venturaEffect opened this issue 7 months ago · 17 comments

First of all congrats, this looks promising.

But, after trying 5 times to finetune with the Mistral 7b model, I see there are no Checkpoints. I've set batch_size to 4 fl16, gradient accumulation to 8,... My JSON file is set on key:value pairs like "instruction" and "output", have changed that where it is "phrase" and "tone". Everything went through just with one Warning but nothing serious. Waited to it finnishes the finetunning. I see models folder with qlora and checkpoints. In QLora I have run_history_gpu-cpu.txt:

There is also another subfolder called again "qlora". Inside are on .json file "gpu-cpu_model.json" and another folder called "gpu-cpu_model". Inside this folder is another folder called adapter with a .json file "adapter_model.json" and "adapter_model.bin"

But in Checkpoints folder there is nothing. As said, tried several times. Different .json datasets. No clue why it doesn't work.

I'm running it on Windows 11 with a NVIDIA RTX 4090. And it shows it is running it on my GPU.

So, what is going on???

When I run python gradio_chat.py I get:

(mistral-7b-env) zasear@zaesarius:/mnt/c/Users/zaesa/OneDrive/Escritorio/AI/Lawyer Mistral Agent/inference$ python gradio_chat.py Number of GPUs available: 1 Running on device: cuda CPU threads: 16 Loading checkpoint shards: 100%|████████████████████████████████████████████| 2/2 [02:44<00:00, 82.49s/it] Traceback (most recent call last): File "/mnt/c/Users/zaesa/OneDrive/Escritorio/AI/Lawyer Mistral Agent/inference/gradio_chat.py", line 40, in <module> usingAdapter = true NameError: name 'true' is not defined

Appreciate any help!

Answer 1 · 2023-12-15T14:46:13.000Z

This might be due to OneDrive interfering with file locks and stuff (I see it in the path).

If I were you, I might try moving everything to some folder that isn't tied to OneDrive.

Answer 2 · 2023-12-15T15:48:25.000Z

@venturaEffect, can you make sure the dataset file is getting copied to the dataset folder? Also did you increased the max_steps to match your needs on the dataset?

Answer 3 · 2023-12-15T18:53:09.000Z

@venturaEffect, can you make sure the dataset file is getting copied to the dataset folder? Also did you increased the max_steps to match your needs on the dataset?

@vriveras Oh, I have include it manually in the dataset. Does it make it automatically? No, haven't increased the max_steps. How do I know which are the required?

Answer 4 · 2023-12-15T18:54:15.000Z

This might be due to OneDrive interfering with file locks and stuff (I see it in the path).

If I were you, I might try moving everything to some folder that isn't tied to OneDrive.

Well it is on my Desktop. Should I move it maybe to User folder?

Answer 5 · 2023-12-16T21:52:27.000Z

@venturaEffect could it be model dependent? Using phi-2 after training there are checkpoints in models folder:

(phi-2-env) elsaco@RIPPER:~/ai/test1/models/checkpoints$ ls -lR
drwxr-xr-x 2 elsaco elsaco 4096 Dec 16 12:56 checkpoint-1000
drwxr-xr-x 2 elsaco elsaco 4096 Dec 16 11:34 checkpoint-500

more output

./checkpoint-1000:
total 822812
-rw-r--r-- 1 elsaco elsaco       464 Dec 16 12:56 README.md
-rw-r--r-- 1 elsaco elsaco       473 Dec 16 12:56 adapter_config.json
-rw-r--r-- 1 elsaco elsaco 167863690 Dec 16 12:56 adapter_model.bin
-rw-r--r-- 1 elsaco elsaco      1098 Dec 16 12:56 added_tokens.json
-rw-r--r-- 1 elsaco elsaco    456318 Dec 16 12:56 merges.txt
-rw-r--r-- 1 elsaco elsaco 671247290 Dec 16 12:56 optimizer.pt
-rw-r--r-- 1 elsaco elsaco     14180 Dec 16 12:56 rng_state.pth
-rw-r--r-- 1 elsaco elsaco      1064 Dec 16 12:56 scheduler.pt
-rw-r--r-- 1 elsaco elsaco       579 Dec 16 12:56 special_tokens_map.json
-rw-r--r-- 1 elsaco elsaco   2115105 Dec 16 12:56 tokenizer.json
-rw-r--r-- 1 elsaco elsaco      7534 Dec 16 12:56 tokenizer_config.json
-rw-r--r-- 1 elsaco elsaco     12015 Dec 16 12:56 trainer_state.json
-rw-r--r-- 1 elsaco elsaco      4472 Dec 16 12:56 training_args.bin
-rw-r--r-- 1 elsaco elsaco    798156 Dec 16 12:56 vocab.json

./checkpoint-500:
total 822812
-rw-r--r-- 1 elsaco elsaco       464 Dec 16 11:34 README.md
-rw-r--r-- 1 elsaco elsaco       473 Dec 16 11:34 adapter_config.json
-rw-r--r-- 1 elsaco elsaco 167863690 Dec 16 11:34 adapter_model.bin
-rw-r--r-- 1 elsaco elsaco      1098 Dec 16 11:34 added_tokens.json
-rw-r--r-- 1 elsaco elsaco    456318 Dec 16 11:34 merges.txt
-rw-r--r-- 1 elsaco elsaco 671247290 Dec 16 11:34 optimizer.pt
-rw-r--r-- 1 elsaco elsaco     14180 Dec 16 11:34 rng_state.pth
-rw-r--r-- 1 elsaco elsaco      1064 Dec 16 11:34 scheduler.pt
-rw-r--r-- 1 elsaco elsaco       579 Dec 16 11:34 special_tokens_map.json
-rw-r--r-- 1 elsaco elsaco   2115105 Dec 16 11:34 tokenizer.json
-rw-r--r-- 1 elsaco elsaco      7534 Dec 16 11:34 tokenizer_config.json
-rw-r--r-- 1 elsaco elsaco      6106 Dec 16 11:34 trainer_state.json
-rw-r--r-- 1 elsaco elsaco      4472 Dec 16 11:34 training_args.bin
-rw-r--r-- 1 elsaco elsaco    798156 Dec 16 11:34 vocab.json

Answer 6 · 2023-12-16T22:00:36.000Z

As for the error on line 40 in gradio_chat.py change true to True and it will work:

(phi-2-env) elsaco@RIPPER:~/ai/test1/inference$ python gradio_chat.py 
Number of GPUs available: 1
Running on device: cuda
CPU threads: 3
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 2/2 [00:05<00:00,  2.75s/it]
Number of GPUs available: 1
Model ../model-cache/microsoft/phi-2 loaded successfully on cuda
Running on local URL:  http://127.0.0.1:7860

Windows side:

Answer 7 · 2023-12-16T22:57:12.000Z

@venturaEffect, can you make sure the dataset file is getting copied to the dataset folder? Also did you increased the max_steps to match your needs on the dataset?

@vriveras Oh, I have include it manually in the dataset. Does it make it automatically? No, haven't increased the max_steps. How do I know which are the required?

The max steps are low of the default dataset which means you are not triggering the eval steps where the checkpoint is created. Once you are done the final adapter should still be available under model [Project Path]\models\qlora\qlora\gpu-cpu_model\adapter\ the inference projects will load it automatically from there for testing.

Thanks for pointing out the bug with the inference on line 40, we have now fixed the templates.

Answer 8 · 2023-12-17T12:55:00.000Z

The max steps are low of the default dataset which means you are not triggering the eval steps where the checkpoint is created. Once you are done the final adapter should still be available under model [Project Path]\models\qlora\qlora\gpu-cpu_model\adapter\ the inference projects will load it automatically from there for testing.

Thanks for pointing out the bug with the inference on line 40, we have now fixed the templates.

Ok, following your steps, even if I don't think it is necssary to max steps the little data that I'm trying to fine tune.

Now, don't know why, when trying to select the path to the project folder (BTW when clicking on the input the mouse pointer shows a forbidden icon ????) now it opens suddenly on the search bar (???) on Visual Studio to select the PATH (Can't select through folders, have to put the path in the bar, really bad as UE. Anyway... )

When Relaunching Window In Workspace now it tells me this:

There is clearly something wrong in routing the path.

Answer 9 · 2023-12-17T13:41:16.000Z

The max steps are low of the default dataset which means you are not triggering the eval steps where the checkpoint is created. Once you are done the final adapter should still be available under model [Project Path]\models\qlora\qlora\gpu-cpu_model\adapter\ the inference projects will load it automatically from there for testing.

Thanks for pointing out the bug with the inference on line 40, we have now fixed the templates.

Ok, following your steps, even if I don't think it is necssary to max steps the little data that I'm trying to fine tune.

Now, don't know why, when trying to select the path to the project folder (BTW when clicking on the input the mouse pointer shows a forbidden icon ????) now it opens suddenly on the search bar (???) on Visual Studio to select the PATH (Can't select through folders, have to put the path in the bar, really bad as UE. Anyway... )

When Relaunching Window In Workspace now it tells me this:

There is clearly something wrong in routing the path.

Thanks for the feedback on the UI we haven't seen that. We will try to repro and get a fix out.

As for the fine-tuning if you have little data there is no need to increase the max steps but also why you need intermediate checkpoints other than the final adapter which is generated at the end in the path I shared. The default setting for save steps is calculated by the trainer, if you want to force the creation you can set 'save_steps' after the project is generate but you will need to use a small number of steps that matches your dataset.

Answer 10 · 2023-12-17T13:54:42.000Z

The max steps are low of the default dataset which means you are not triggering the eval steps where the checkpoint is created. Once you are done the final adapter should still be available under model [Project Path]\models\qlora\qlora\gpu-cpu_model\adapter\ the inference projects will load it automatically from there for testing.

Thanks for pointing out the bug with the inference on line 40, we have now fixed the templates.

Ok, following your steps, even if I don't think it is necssary to max steps the little data that I'm trying to fine tune.
Now, don't know why, when trying to select the path to the project folder (BTW when clicking on the input the mouse pointer shows a forbidden icon ????) now it opens suddenly on the search bar (???) on Visual Studio to select the PATH (Can't select through folders, have to put the path in the bar, really bad as UE. Anyway... )
When Relaunching Window In Workspace now it tells me this:

There is clearly something wrong in routing the path.

Thanks for the feedback on the UI we haven't seen that. We will try to repro and get a fix out.

As for the fine-tuning if you have little data there is no need to increase the max steps but also why you need intermediate checkpoints other than the final adapter which is generated at the end in the path I shared. The default setting for save steps is calculated by the trainer, if you want to force the creation you can set 'save_steps' after the project is generate but you will need to use a small number of steps that matches your dataset.

Thanks for the answer, but it isn't clear to me how would be the next steps. I have no checkpoints. Having the adapter, what would be the next step? How do I interact with my finetune Mistral model?

Appreciate

Answer 11 · 2023-12-17T14:09:46.000Z

The project is not saved inside WSL instance but on C:\home\<username>\ that's why the error Workspace does not exist. See #28

To continue finetuning copy the project from C:\home to ~home inside WSL then run conda activate mistral-7b-env and follow README for more info.

Answer 12 · 2023-12-17T14:14:02.000Z

After the fine-tuning is done just run one of the inferencing files console_chat.py or gradio_chat.py they will automatically load the model with the adapter. You can take a look here https://github.com/microsoft/windows-ai-studio/blob/main/walkthrough-simple-dataset.md#inferencing-with-the-fine-tuned-model

Answer 13 · 2023-12-17T16:25:28.000Z

After the fine-tuning is done just run one of the inferencing files console_chat.py or gradio_chat.py they will automatically load the model with the adapter. You can take a look here https://github.com/microsoft/windows-ai-studio/blob/main/walkthrough-simple-dataset.md#inferencing-with-the-fine-tuned-model

Ok, thanks, will try. In any case this seems to be a bug.

Answer 14 · 2023-12-17T16:26:15.000Z

The project is not saved inside WSL instance but on C:\home\<username>\ that's why the error Workspace does not exist. See #28

To continue finetuning copy the project from C:\home to ~home inside WSL then run conda activate mistral-7b-env and follow README for more info.

I appreciate your response. Just can't understand why this new workaround, and opening the bar with a default error PATH. Non sense.

Answer 15 · 2023-12-17T20:38:15.000Z

The project is not saved inside WSL instance but on C:\home\<username>\ that's why the error Workspace does not exist. See #28

To continue finetuning copy the project from C:\home to ~home inside WSL then run conda activate mistral-7b-env and follow README for more info.

I appreciate your response. Just can't understand why this new workaround, and opening the bar with a default error PATH. Non sense.

This has nothing to do with the checkpoint. Projects are always created in Windows unless you placed them in WSL when creating it. WSL mounts your drives automatically and we use that as the location. C:\users\ becomes /mnt/c/users automatically when launching. You can manually open the project after they are created by just opening the folder on a WSL remote. Go to the folder in WSL and type 'code .' from the folder and it will open it. You can follow the instructions after that to do inference.

Answer 16 · 2023-12-17T22:42:14.000Z

The project is not saved inside WSL instance but on C:\home\<username>\ that's why the error Workspace does not exist. See #28

To continue finetuning copy the project from C:\home to ~home inside WSL then run conda activate mistral-7b-env and follow README for more info.

I appreciate your response. Just can't understand why this new workaround, and opening the bar with a default error PATH. Non sense.

This has nothing to do with the checkpoint. Projects are always created in Windows unless you placed them in WSL when creating it. WSL mounts your drives automatically and we use that as the location. C:\users\ becomes /mnt/c/users automatically when launching. You can manually open the project after they are created by just opening the folder on a WSL remote. Go to the folder in WSL and type 'code .' from the folder and it will open it. You can follow the instructions after that to do inference.

This is a joke:

Following your steps. The Path is good (if it is how you say.... ) But no,

Since yesterday night. Don't know what you have done it just doesn't find the workspace (?).

Also have done all the other things you have said: changed your code error (from true to True), max_steps also (even if not necessary), tried to run it even if there are no checkpoints because there is the adapter and ti will work (??)

Nothing, errors, wrong paths, icons showing forbidden when hovering in the path input (?). Tried like over 20 times. One time it stayed like 24 hours (on Windows 11 NVIDIA 4090).

The promise where great and should be legit coming from Windows (??). But start to miss Langchain.

Answer 17 · 2024-01-03T23:20:20.000Z

Sorry for delayed response. Please make sure that you are disconnected from WSL before creating the workspace. For example, the left bottom of the VS Code window shows your connection status.

Please make sure to disconnect from WSL before creating the workspace. Once the workspace is created, please use WSL.
It should look something like this while you are creating the workspace.