microsoft/archai

[BUG] crash in algos.ipynb when I try and run it on my cuda device...

lovettchris opened this issue · 2 comments

Describe the bug

I don't know if this is a windows thing or not but when I run the PartialTrainingValAccuracy on my cuda device the parallel_partial_tr block crashes with:

error 18:24:45.920: Raw kernel process exited code: 3
error 18:24:45.922: Error in waiting for cell to complete Error: Canceled future for execute_request message before replies were done
    at t.KernelShellFutureHandler.dispose (c:\Users\clovett\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:33213)
    at c:\Users\clovett\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:52265
    at Map.forEach (<anonymous>)
    at y._clearKernelState (c:\Users\clovett\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:52250)
    at y.dispose (c:\Users\clovett\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:45732)
    at c:\Users\clovett\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:17:139244
    at Z (c:\Users\clovett\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:2:1608939)
    at Kp.dispose (c:\Users\clovett\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:17:139221)
    at qp.dispose (c:\Users\clovett\.vscode\extensions\ms-toolsai.jupyter-2023.1.2000312134\out\extension.node.js:17:146518)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
warn 18:24:45.923: Cell completed with errors {
  message: 'Canceled future for execute_request message before replies were done'

I wonder if this description included in your markdown is missing the device="cuda" parameter on the PartialTrainingValAccuracy constructor?

RayParallelObjective(
    PartialTrainingValAccuracy(training_epochs=1),
    num_gpus=0.5, # 2 jobs per gpu available
    max_calls=1
)

Because this is what you have in the code a bit later on:

    RayParallelEvaluator(
        PartialTrainingValAccuracy(training_epochs=1, device='cuda'),
        num_gpus=0.5, # 2 jobs per gpu available
        max_calls=1
    ),

So you might want to mention here that this will require your machine have GPU and CUDA python setup... I did and so this worked on my machine, but a heads up might be necessary for other readers... is there a "first notebook" entry point to all these notebooks?

I am tagging @piero2c to some issues because he was the one who implemented discrete search-based notebooks and the discussion will be better with him included 😄

Hi @chris, I agree. I'll add a disclaimer