GPU Support

Question

GPU Support

kojix2 opened this issue 5 years ago · 10 comments

kojix2 commented 5 years ago

Hi @unagiootoro

I ran the XOR sample and found that Cumo was slower than Numo.

If you don't mind my asking, Do you have a GPU + CUDA environment?

If you don't have a GPU, someone in Ruby community (including me) will support you with a donation...

Answer 1 · 2019-08-18T08:49:25.000Z

I'm not good at English.
So it takes time to enter long sentences.
May I speak in Japanese with you?

Answer 2 · 2019-08-18T08:58:28.000Z

I'm not good at English, either.
Almost all of my English is written by "Mirai Translate".

Answer 3 · 2019-08-18T09:20:09.000Z

Okay.
My environment is an environment where Cuda can be build.
But since the OS is Windows, I can't build Cumo.
So, I want to be able to learning using ruby-dnn and GPU on Windows.
(I still don't know how to do it)

As a guess, the XOR example has less parallel computation, so Cumo will be slower than Numo.

Answer 4 · 2019-08-18T09:22:56.000Z

OK. I see.
Maybe you're right about why XOR is slow.

Answer 5 · 2019-08-19T04:23:38.000Z

One more question. How do you install Ruby on Windows?

RubyInstaller + DevKit + MYS2?
Windows Subsystem for Linux?
VirtualBox/VMware + Linux?
Docker for Windows?
None of the above

This may be important for building the Cumo.

Answer 6 · 2019-08-21T13:56:04.000Z

I'm using WSL, but WSL doesn't support GPU, so I tried install Cumo with RubyInstaller.
But install was not successful.

I think windows nvcc doesn't support gcc, so it's hard to run Cumo on RubyInstaller.

Answer 7 · 2019-09-06T04:26:06.000Z

I compared the time using the mnist sample.

mnist_example_for_profiler.rb

require "dnn"

include DNN::Layers
include DNN::Activations
include DNN::Optimizers
include DNN::Losses
include DNN::Models

x_train = SFloat.cast Marshal.load(File.binread("x_train.dat"))
x_test  = SFloat.cast Marshal.load(File.binread("x_test.dat"))
y_train = SFloat.cast Marshal.load(File.binread("y_train.dat"))
y_test  = SFloat.cast Marshal.load(File.binread("y_test.dat"))

model = Sequential.new
model << InputLayer.new(784)
model << Dense.new(256)
model << ReLU.new
model << Dense.new(256)
model << ReLU.new
model << Dense.new(10)
model.setup(RMSProp.new, SoftmaxCrossEntropy.new)

model.train(x_train, y_train, 20, batch_size: 100, test: [x_test, y_test], verbose: true)

Time

Numo

# real	1m44.682s
# user	7m4.630s
# sys	5m51.241s

Cumo

# real	1m35.018s
# user	1m17.208s
# sys	0m22.956s

stackprof

main.rb

require 'stackprof'
require 'optparse'

opt = ARGV.getopts("g", "gpu", "out:")

if opt['g'] or opt['gpu']
  puts "Use Cumo"
  require 'cumo'

  # https://github.com/sonots/cumo/issues/143
  SFloat = Cumo::SFloat
  class SFloat
    alias mean_original mean
    def mean(*args)
      if size == 1
        self[0]
      else
        mean_original(*args)
      end
    end
  end

else
  puts "Use Numo"
  require "numo/linalg"
  SFloat = Numo::SFloat
end

StackProf.run(mode: :cpu, out: opt["out"], raw: true) do
  load "./mnist_example_for_profiler.rb"
end

ruby main.rb    --out profile/numo-mnist.dump
ruby main.rb -g --out profile/cumo-mnist.dump
stackprof profile/numo-mnist.dump
stackprof profile/cumo-mnist.dump

Mode: cpu(1000)
Samples: 25320 (11.51% miss rate)
GC: 1298 (5.13%)

 TOTAL    (pct)     SAMPLES    (pct)     FRAME
  9808  (38.7%)        9808  (38.7%)     DNN::Optimizers::RMSProp#update_params
  7195  (28.4%)        7075  (27.9%)     #<Module:0x0000556476340dc0>.call
  2094   (8.3%)        2094   (8.3%)     DNN::Activations::ReLU#backward
  1298   (5.1%)        1298   (5.1%)     (garbage collection)
  1010   (4.0%)        1010   (4.0%)     DNN::Activations::ReLU#forward
  8062  (31.8%)         802   (3.2%)     #<Module:0x0000556476340e88>.dot
  5717  (22.6%)         768   (3.0%)     DNN::Layers::Dense#backward
 10461  (41.3%)         646   (2.6%)     DNN::Optimizers::Optimizer#update
   411   (1.6%)         411   (1.6%)     DNN::Models::Model#evaluate
 24004  (94.8%)         355   (1.4%)     DNN::Models::Model#train
   183   (0.7%)         183   (0.7%)     DNN::Losses::SoftmaxCrossEntropy.softmax
   350   (1.4%)         167   (0.7%)     DNN::Losses::SoftmaxCrossEntropy#forward_loss
   120   (0.5%)         120   (0.5%)     #<Module:0x0000556476340e88>.blas_char
  3307  (13.1%)         101   (0.4%)     DNN::Layers::Dense#forward
  8155  (32.2%)          93   (0.4%)     Numo::NArray#dot
    65   (0.3%)          65   (0.3%)     Numo::NArray.asarray
    59   (0.2%)          56   (0.2%)     DNN::Models::Model#layers
    52   (0.2%)          33   (0.1%)     DNN::Iterator#next_batch
    29   (0.1%)          29   (0.1%)     DNN::Layers::Layer#built?
    26   (0.1%)          26   (0.1%)     DNN::Link#initialize
   381   (1.5%)          25   (0.1%)     DNN::Losses::Loss#forward
    23   (0.1%)          23   (0.1%)     DNN::Losses::SoftmaxCrossEntropy#backward_loss
    22   (0.1%)          22   (0.1%)     DNN::Iterator#reset_indexs
 22604  (89.3%)          18   (0.1%)     DNN::Models::Model#train_on_batch
    15   (0.1%)          15   (0.1%)     DNN::Layers::InputLayer#forward
  7827  (30.9%)          13   (0.1%)     DNN::Models::Model#backward
    10   (0.0%)          10   (0.0%)     DNN::Layers::Connection#regularizers
 24022  (94.9%)           9   (0.0%)     <top (required)>
    34   (0.1%)           7   (0.0%)     DNN::Losses::Loss#backward
     7   (0.0%)           7   (0.0%)     DNN::Layers::Connection#get_params

Mode: cpu(1000)
Samples: 22564 (9.41% miss rate)
GC: 164 (0.73%)

 TOTAL    (pct)     SAMPLES    (pct)     FRAME
  8007  (35.5%)        8007  (35.5%)     DNN::Models::Model#evaluate
  7692  (34.1%)        7673  (34.0%)     Cumo::NArray#dot
  3259  (14.4%)        3259  (14.4%)     DNN::Activations::ReLU#backward
  1163   (5.2%)        1163   (5.2%)     DNN::Optimizers::RMSProp#update_params
  1017   (4.5%)        1017   (4.5%)     Cumo::NArray#to_f
   352   (1.6%)         340   (1.5%)     DNN::Iterator#next_batch
   280   (1.2%)         177   (0.8%)     DNN::Losses::SoftmaxCrossEntropy#forward_loss
   164   (0.7%)         164   (0.7%)     (garbage collection)
  1281   (5.7%)         115   (0.5%)     DNN::Optimizers::Optimizer#update
  4232  (18.8%)         105   (0.5%)     DNN::Layers::Dense#backward
   103   (0.5%)         103   (0.5%)     DNN::Losses::SoftmaxCrossEntropy.softmax
 22385  (99.2%)          85   (0.4%)     DNN::Models::Model#train
  3631  (16.1%)          66   (0.3%)     DNN::Layers::Dense#forward
    62   (0.3%)          62   (0.3%)     DNN::Activations::ReLU#forward
    32   (0.1%)          32   (0.1%)     DNN::Losses::SoftmaxCrossEntropy#backward_loss
    23   (0.1%)          22   (0.1%)     DNN::Models::Model#layers
    19   (0.1%)          19   (0.1%)     Cumo::NArray.asarray
   301   (1.3%)          17   (0.1%)     DNN::Losses::Loss#forward
    16   (0.1%)          16   (0.1%)     DNN::Layers::Layer#built?
    16   (0.1%)          16   (0.1%)     DNN::Link#initialize
    15   (0.1%)          15   (0.1%)     DNN::Iterator#reset_indexs
    13   (0.1%)          13   (0.1%)     Cumo::SFloat#mean
 22400  (99.3%)          11   (0.0%)     <top (required)>
    11   (0.0%)          11   (0.0%)     DNN::Layers::InputLayer#forward
  7502  (33.2%)          10   (0.0%)     DNN::Models::Model#backward
 12288  (54.5%)           9   (0.0%)     DNN::Models::Model#train_on_batch
     9   (0.0%)           9   (0.0%)     DNN::Layers::Connection#regularizers
    40   (0.2%)           3   (0.0%)     DNN::Losses::Loss#backward
    23   (0.1%)           3   (0.0%)     DNN::Layers::InputLayer#call
  8667  (38.4%)           3   (0.0%)     DNN::Models::Model#accurate

Maybe the Cumo will be a little faster if you tune it..

Answer 8 · 2019-09-08T14:32:29.000Z

Thank you for taking a benchmarking Numo and Cumo.

When using Cumo, it is necessary to reduce the transfer of CPU and GPU, so modify 'evaluate' method as follows. (Only multi-class classification is modified)

private def evaluate(y, t)
  if y.shape[1..-1] == [1]
    correct = 0
    y.shape[0].times do |i|
      if @loss_func.is_a?(Losses::SigmoidCrossEntropy)
        correct += 1 if (y[i, 0] < 0 && t[i, 0] < 0.5) || (y[i, 0] >= 0 && t[i, 0] >= 0.5)
      else
        correct += 1 if (y[i, 0] < 0 && t[i, 0] < 0) || (y[i, 0] >= 0 && t[i, 0] >= 0)
      end
    end
  else
    correct = y.max_index(axis: 1).eq(t.max_index(axis: 1)).count
  end
  correct
end

I think this may make Cumo faster.

Answer 9 · 2019-09-15T10:47:32.000Z

ruby-dnn & Cumo got faster with version 0.13.0 !
Same benchmark code as above

Numo

real 1m47.222s
user 7m29.562s <- Numo::Linulg!! blazing performance
sys 6m4.522s

 TOTAL    (pct)     SAMPLES    (pct)     FRAME
 12618  (51.0%)       12618  (51.0%)     DNN::Optimizers::RMSProp#update_params
  4510  (18.2%)        4414  (17.8%)     #<Module:0x000056193c6c03a8>.call
  1997   (8.1%)        1997   (8.1%)     DNN::Activations::ReLU#backward
  1279   (5.2%)        1279   (5.2%)     (garbage collection)
   895   (3.6%)         895   (3.6%)     DNN::Activations::ReLU#forward
  5410  (21.9%)         845   (3.4%)     #<Module:0x000056193c6c0718>.dot
  4490  (18.1%)         710   (2.9%)     DNN::Layers::Dense#backward
 13250  (53.6%)         625   (2.5%)     DNN::Optimizers::Optimizer#update
 23446  (94.8%)         316   (1.3%)     DNN::Models::Model#train
   260   (1.1%)         260   (1.1%)     DNN::Losses::SoftmaxCrossEntropy.softmax
   467   (1.9%)         207   (0.8%)     DNN::Losses::SoftmaxCrossEntropy#forward
  1800   (7.3%)         114   (0.5%)     DNN::Layers::Dense#forward
    96   (0.4%)          96   (0.4%)     #<Module:0x000056193c6c0718>.blas_char
  5466  (22.1%)          56   (0.2%)     Numo::NArray#dot
    55   (0.2%)          55   (0.2%)     Numo::NArray.asarray
    24   (0.1%)          24   (0.1%)     DNN::Losses::SoftmaxCrossEntropy#backward
    25   (0.1%)          22   (0.1%)     DNN::Iterator#next_batch
    21   (0.1%)          21   (0.1%)     DNN::Iterator#reset
    21   (0.1%)          21   (0.1%)     DNN::Layers::Layer#built?
    21   (0.1%)          21   (0.1%)     DNN::Link#initialize
    18   (0.1%)          17   (0.1%)     DNN::Models::Model#layers
    19   (0.1%)          17   (0.1%)     DNN::Losses::Loss#regularizers_backward
    15   (0.1%)          15   (0.1%)     DNN::Layers::InputLayer#forward
 22880  (92.5%)          13   (0.1%)     DNN::Models::Model#train_on_batch
    15   (0.1%)          10   (0.0%)     DNN::Losses::Loss#regularizers_forward
 23464  (94.8%)           9   (0.0%)     <top (required)>
     8   (0.0%)           8   (0.0%)     #<Module:0x000056193c890480>.learning_phase=
     7   (0.0%)           7   (0.0%)     DNN::Models::Model#evaluate
     7   (0.0%)           7   (0.0%)     DNN::Layers::Connection#get_params
     7   (0.0%)           7   (0.0%)     DNN::Layers::Connection#regularizers

Cumo

real 1m6.295s <- 1m35.018s
user 0m58.364s
sys 0m12.208s

 TOTAL    (pct)     SAMPLES    (pct)     FRAME
  5954  (50.7%)        5940  (50.6%)     Cumo::NArray#dot
  3250  (27.7%)        3250  (27.7%)     DNN::Activations::ReLU#backward
   898   (7.6%)         898   (7.6%)     DNN::Optimizers::RMSProp#update_params
   526   (4.5%)         526   (4.5%)     Cumo::NArray#to_f
   200   (1.7%)         200   (1.7%)     DNN::Iterator#next_batch
   264   (2.2%)         163   (1.4%)     DNN::Losses::SoftmaxCrossEntropy#forward
  3858  (32.9%)         131   (1.1%)     DNN::Layers::Dense#backward
   101   (0.9%)         101   (0.9%)     DNN::Losses::SoftmaxCrossEntropy.softmax
  2321  (19.8%)          94   (0.8%)     DNN::Layers::Dense#forward
    71   (0.6%)          71   (0.6%)     DNN::Activations::ReLU#forward
    69   (0.6%)          69   (0.6%)     (garbage collection)
   966   (8.2%)          67   (0.6%)     DNN::Optimizers::Optimizer#update
 11660  (99.3%)          41   (0.3%)     DNN::Models::Model#train
   309   (2.6%)          32   (0.3%)     DNN::Losses::Loss#loss
    29   (0.2%)          29   (0.2%)     DNN::Losses::SoftmaxCrossEntropy#backward
    21   (0.2%)          21   (0.2%)     DNN::Models::Model#evaluate
    14   (0.1%)          14   (0.1%)     Cumo::NArray.asarray
    11   (0.1%)          11   (0.1%)     DNN::Iterator#reset
    11   (0.1%)          11   (0.1%)     DNN::Link#initialize
 11674  (99.4%)          10   (0.1%)     <top (required)>
    13   (0.1%)           9   (0.1%)     DNN::Losses::Loss#regularizers_forward
    10   (0.1%)           9   (0.1%)     DNN::Models::Model#layers
     9   (0.1%)           9   (0.1%)     DNN::Layers::Layer#built?
     7   (0.1%)           7   (0.1%)     Cumo::SFloat#mean
  2413  (20.5%)           4   (0.0%)     DNN::Layers::Layer#call
 10832  (92.2%)           4   (0.0%)     DNN::Models::Model#train_on_batch
     4   (0.0%)           4   (0.0%)     DNN::Layers::InputLayer#forward
     4   (0.0%)           4   (0.0%)     DNN::Layers::Connection#regularizers
     2   (0.0%)           2   (0.0%)     DNN::Losses::Loss#regularizers_backward
     3   (0.0%)           1   (0.0%)     <top (required)>

Answer 10 · 2019-09-21T16:33:06.000Z

With Google Colab, you can benchmark ruby-dnn in your browser.
Please help yourself !
https://colab.research.google.com/drive/1RJ8HTNI6akqBYZgZWzFve9c6GTz_Tava

Google colab で RubyからGPU を使って行列計算する [Japanese]