peiyunh/tiny

hr_res101('train') error : "Error using gpuDevice (line 26) Invalid CUDA device id"

niamul070 opened this issue · 2 comments

When I run hr_res101('train"), I am getting the error mentioned above. Can you tell how to fix it. Below is the detailed output and error message:

hr_res101('train');

ans =

models/widerface-resnet-101-simple-sample256-posfrac0.5-N25-bboxreg-cluster-scaled

Trying to initialize the structure of resnet-101-simple
Unknown model: cannot initialize.
Loading pretrained weights from ./trained_models/imagenet-resnet-101-dag.mat
Loaded imdb from data/widerface/imdb.mat
cluster path: data/widerface/RefBox_N25_scaled.mat

opts =

struct with fields:

  keepDilatedZeros: 0
         inputSize: [500 500]
      learningRate: [1×30 double]
           trainFn: '@cnn_train_dag_hardmine'
     batchGetterFn: '@cnn_get_batch_hardmine'
      freezeResNet: 0
               tag: ''
        clusterNum: 25
       clusterName: 'scaled'
           bboxReg: 1
        skipLRMult: [0 1 0.1000]
        sampleSize: 256
       posFraction: 0.5000
         posThresh: 0.7000
         negThresh: 0.3000
            border: [0 0]
 pretrainModelPath: './trained_models/imagenet-resnet-101-dag.mat'
           dataDir: 'data/widerface'
         modelType: 'resnet-101-simple'
       networkType: 'dagnn'
batchNormalization: 1
  weightInitMethod: 'gaussian'
    minClusterSize: [10 10]
    maxClusterSize: [Inf Inf]
            expDir: 'models/widerface-resnet-101-simple-sample256-posf...'
         batchSize: 48
     numSubBatches: 1
         numEpochs: 50
              gpus: [1 2 3 4]
   numFetchThreads: 8
              lite: 0
          imdbPath: 'data/widerface/imdb.mat'
             train: [1×1 struct]

ans =

struct with fields:

            gpus: [1 2 3 4]
       batchSize: 48
   numSubBatches: 1
       numEpochs: 50
    learningRate: [1×30 double]
keepDilatedZeros: 0

Start using dagnn.DetLoss for loss
Starting parallel pool (parpool) using the 'local' profile ... Warning: The system time zone setting, 'US/Eastern', does not specify a single
time zone unambiguously. It will be treated as 'America/New_York'. See the datetime.TimeZone property for
details about specifying time zones.

In verifyTimeZone (line 23)
In datetime (line 503)
In parallel.internal.cluster.FileSerializer>iLoadDate (line 345)
In parallel.internal.cluster.FileSerializer/getFields (line 100)
In parallel.internal.cluster.CJSSupport/getProperties (line 252)
In parallel.internal.cluster.CJSSupport/getJobProperties (line 463)
In parallel.internal.cluster.CJSJobMixin/hGetProperty (line 70)
In parallel.internal.cluster.CJSJobMixin/hSetTerminalStateFromCluster (line 98)
In parallel.cluster.CJSCluster/hGetJobState (line 361)
In parallel.internal.cluster.CJSJobMixin/getStateEnum (line 136)
In parallel.Job/get.StateEnum (line 214)
In parallel.Job/get.State (line 206)
In parallel.internal.customattr.CustomGetSet>iVectorisedGetHelper (line 107)
In parallel.internal.customattr.CustomGetSet>@(a,b,c)iVectorisedGetHelper(obj,a,b,c) (line 89)
In parallel.internal.customattr.CustomGetSet/doVectorisedGet (line 90)
In parallel.internal.customattr.CustomGetSet/hVectorisedGet (line 64)
In parallel.internal.customattr.GetSetImpl>iAccessProperties (line 289)
In parallel.internal.customattr.GetSetImpl>iGetAllProperties (line 250)
In parallel.internal.customattr.GetSetImpl.getImpl (line 124)
In parallel.internal.customattr.CustomGetSet/get (line 30)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 464)
In parallel.internal.pool.InteractiveClient/start (line 311)
In parallel.Pool>iStartClient (line 567)
In parallel.Pool.hBuildPool (line 446)
In parallel.internal.pool.doParpool (line 15)
In parpool (line 89)
In cnn_train_dag_hardmine>prepareGPUs (line 604)
In cnn_train_dag_hardmine (line 132)
In cnn_widerface (line 212)
In hr_res101 (line 41)
connected to 4 workers.
cnn_train_dag_hardmine: resetting GPU
Error using cnn_train_dag_hardmine>prepareGPUs (line 616)
Error detected on worker 3.

Error in cnn_train_dag_hardmine (line 132)
prepareGPUs(opts, epoch == start+1) ;

Error in cnn_widerface (line 212)
[net, info] = trainFn(net, imdb, getBatchFn(batchGetter, opts, net.meta), ...

Error in hr_res101 (line 41)
cnn_widerface('inputSize', inputSize, ...

Caused by:
Error using gpuDevice (line 26)
Invalid CUDA device id: 3. Select a device id from the range 1:1.

When I run gpuDevice from matlab prompt this is what I get:

gpuDevice

ans =

CUDADevice with properties:

                  Name: 'Quadro M4000'
                 Index: 1
     ComputeCapability: '5.2'
        SupportsDouble: 1
         DriverVersion: 8
        ToolkitVersion: 7.5000
    MaxThreadsPerBlock: 1024
      MaxShmemPerBlock: 49152
    MaxThreadBlockSize: [1024 1024 64]
           MaxGridSize: [2.1475e+09 65535 65535]
             SIMDWidth: 32
           TotalMemory: 8.4922e+09
       AvailableMemory: 7.5519e+09
   MultiprocessorCount: 13
          ClockRateKHz: 772500
           ComputeMode: 'Default'
  GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
      CanMapHostMemory: 1
       DeviceSupported: 1
        DeviceSelected: 1

Never mind I solved it. Thanks

I'm facing the same problem. Can you please tell me how you resolved this issue?