unagiootoro/ruby-dnn

GPU Support

kojix2 opened this issue · 10 comments

Hi @unagiootoro

I ran the XOR sample and found that Cumo was slower than Numo.

If you don't mind my asking, Do you have a GPU + CUDA environment?

If you don't have a GPU, someone in Ruby community (including me) will support you with a donation...

I'm not good at English.
So it takes time to enter long sentences.
May I speak in Japanese with you?

I'm not good at English, either.
Almost all of my English is written by "Mirai Translate".

image
image

Okay.
My environment is an environment where Cuda can be build.
But since the OS is Windows, I can't build Cumo.
So, I want to be able to learning using ruby-dnn and GPU on Windows.
(I still don't know how to do it)

As a guess, the XOR example has less parallel computation, so Cumo will be slower than Numo.

OK. I see.
Maybe you're right about why XOR is slow.

One more question. How do you install Ruby on Windows?

  • RubyInstaller + DevKit + MYS2?
  • Windows Subsystem for Linux?
  • VirtualBox/VMware + Linux?
  • Docker for Windows?
  • None of the above

This may be important for building the Cumo.

I'm using WSL, but WSL doesn't support GPU, so I tried install Cumo with RubyInstaller.
But install was not successful.

I think windows nvcc doesn't support gcc, so it's hard to run Cumo on RubyInstaller.

I compared the time using the mnist sample.

mnist_example_for_profiler.rb

require "dnn"

include DNN::Layers
include DNN::Activations
include DNN::Optimizers
include DNN::Losses
include DNN::Models

x_train = SFloat.cast Marshal.load(File.binread("x_train.dat"))
x_test  = SFloat.cast Marshal.load(File.binread("x_test.dat"))
y_train = SFloat.cast Marshal.load(File.binread("y_train.dat"))
y_test  = SFloat.cast Marshal.load(File.binread("y_test.dat"))

model = Sequential.new
model << InputLayer.new(784)
model << Dense.new(256)
model << ReLU.new
model << Dense.new(256)
model << ReLU.new
model << Dense.new(10)
model.setup(RMSProp.new, SoftmaxCrossEntropy.new)

model.train(x_train, y_train, 20, batch_size: 100, test: [x_test, y_test], verbose: true)

Time

Numo

# real	1m44.682s
# user	7m4.630s
# sys	5m51.241s

Cumo

# real	1m35.018s
# user	1m17.208s
# sys	0m22.956s

stackprof

main.rb

require 'stackprof'
require 'optparse'

opt = ARGV.getopts("g", "gpu", "out:")

if opt['g'] or opt['gpu']
  puts "Use Cumo"
  require 'cumo'

  # https://github.com/sonots/cumo/issues/143
  SFloat = Cumo::SFloat
  class SFloat
    alias mean_original mean
    def mean(*args)
      if size == 1
        self[0]
      else
        mean_original(*args)
      end
    end
  end

else
  puts "Use Numo"
  require "numo/linalg"
  SFloat = Numo::SFloat
end

StackProf.run(mode: :cpu, out: opt["out"], raw: true) do
  load "./mnist_example_for_profiler.rb"
end
ruby main.rb    --out profile/numo-mnist.dump
ruby main.rb -g --out profile/cumo-mnist.dump
stackprof profile/numo-mnist.dump
stackprof profile/cumo-mnist.dump

Mode: cpu(1000)
Samples: 25320 (11.51% miss rate)
GC: 1298 (5.13%)

 TOTAL    (pct)     SAMPLES    (pct)     FRAME
  9808  (38.7%)        9808  (38.7%)     DNN::Optimizers::RMSProp#update_params
  7195  (28.4%)        7075  (27.9%)     #<Module:0x0000556476340dc0>.call
  2094   (8.3%)        2094   (8.3%)     DNN::Activations::ReLU#backward
  1298   (5.1%)        1298   (5.1%)     (garbage collection)
  1010   (4.0%)        1010   (4.0%)     DNN::Activations::ReLU#forward
  8062  (31.8%)         802   (3.2%)     #<Module:0x0000556476340e88>.dot
  5717  (22.6%)         768   (3.0%)     DNN::Layers::Dense#backward
 10461  (41.3%)         646   (2.6%)     DNN::Optimizers::Optimizer#update
   411   (1.6%)         411   (1.6%)     DNN::Models::Model#evaluate
 24004  (94.8%)         355   (1.4%)     DNN::Models::Model#train
   183   (0.7%)         183   (0.7%)     DNN::Losses::SoftmaxCrossEntropy.softmax
   350   (1.4%)         167   (0.7%)     DNN::Losses::SoftmaxCrossEntropy#forward_loss
   120   (0.5%)         120   (0.5%)     #<Module:0x0000556476340e88>.blas_char
  3307  (13.1%)         101   (0.4%)     DNN::Layers::Dense#forward
  8155  (32.2%)          93   (0.4%)     Numo::NArray#dot
    65   (0.3%)          65   (0.3%)     Numo::NArray.asarray
    59   (0.2%)          56   (0.2%)     DNN::Models::Model#layers
    52   (0.2%)          33   (0.1%)     DNN::Iterator#next_batch
    29   (0.1%)          29   (0.1%)     DNN::Layers::Layer#built?
    26   (0.1%)          26   (0.1%)     DNN::Link#initialize
   381   (1.5%)          25   (0.1%)     DNN::Losses::Loss#forward
    23   (0.1%)          23   (0.1%)     DNN::Losses::SoftmaxCrossEntropy#backward_loss
    22   (0.1%)          22   (0.1%)     DNN::Iterator#reset_indexs
 22604  (89.3%)          18   (0.1%)     DNN::Models::Model#train_on_batch
    15   (0.1%)          15   (0.1%)     DNN::Layers::InputLayer#forward
  7827  (30.9%)          13   (0.1%)     DNN::Models::Model#backward
    10   (0.0%)          10   (0.0%)     DNN::Layers::Connection#regularizers
 24022  (94.9%)           9   (0.0%)     <top (required)>
    34   (0.1%)           7   (0.0%)     DNN::Losses::Loss#backward
     7   (0.0%)           7   (0.0%)     DNN::Layers::Connection#get_params

Mode: cpu(1000)
Samples: 22564 (9.41% miss rate)
GC: 164 (0.73%)

 TOTAL    (pct)     SAMPLES    (pct)     FRAME
  8007  (35.5%)        8007  (35.5%)     DNN::Models::Model#evaluate
  7692  (34.1%)        7673  (34.0%)     Cumo::NArray#dot
  3259  (14.4%)        3259  (14.4%)     DNN::Activations::ReLU#backward
  1163   (5.2%)        1163   (5.2%)     DNN::Optimizers::RMSProp#update_params
  1017   (4.5%)        1017   (4.5%)     Cumo::NArray#to_f
   352   (1.6%)         340   (1.5%)     DNN::Iterator#next_batch
   280   (1.2%)         177   (0.8%)     DNN::Losses::SoftmaxCrossEntropy#forward_loss
   164   (0.7%)         164   (0.7%)     (garbage collection)
  1281   (5.7%)         115   (0.5%)     DNN::Optimizers::Optimizer#update
  4232  (18.8%)         105   (0.5%)     DNN::Layers::Dense#backward
   103   (0.5%)         103   (0.5%)     DNN::Losses::SoftmaxCrossEntropy.softmax
 22385  (99.2%)          85   (0.4%)     DNN::Models::Model#train
  3631  (16.1%)          66   (0.3%)     DNN::Layers::Dense#forward
    62   (0.3%)          62   (0.3%)     DNN::Activations::ReLU#forward
    32   (0.1%)          32   (0.1%)     DNN::Losses::SoftmaxCrossEntropy#backward_loss
    23   (0.1%)          22   (0.1%)     DNN::Models::Model#layers
    19   (0.1%)          19   (0.1%)     Cumo::NArray.asarray
   301   (1.3%)          17   (0.1%)     DNN::Losses::Loss#forward
    16   (0.1%)          16   (0.1%)     DNN::Layers::Layer#built?
    16   (0.1%)          16   (0.1%)     DNN::Link#initialize
    15   (0.1%)          15   (0.1%)     DNN::Iterator#reset_indexs
    13   (0.1%)          13   (0.1%)     Cumo::SFloat#mean
 22400  (99.3%)          11   (0.0%)     <top (required)>
    11   (0.0%)          11   (0.0%)     DNN::Layers::InputLayer#forward
  7502  (33.2%)          10   (0.0%)     DNN::Models::Model#backward
 12288  (54.5%)           9   (0.0%)     DNN::Models::Model#train_on_batch
     9   (0.0%)           9   (0.0%)     DNN::Layers::Connection#regularizers
    40   (0.2%)           3   (0.0%)     DNN::Losses::Loss#backward
    23   (0.1%)           3   (0.0%)     DNN::Layers::InputLayer#call
  8667  (38.4%)           3   (0.0%)     DNN::Models::Model#accurate

Maybe the Cumo will be a little faster if you tune it..

Thank you for taking a benchmarking Numo and Cumo.

When using Cumo, it is necessary to reduce the transfer of CPU and GPU, so modify 'evaluate' method as follows. (Only multi-class classification is modified)

private def evaluate(y, t)
  if y.shape[1..-1] == [1]
    correct = 0
    y.shape[0].times do |i|
      if @loss_func.is_a?(Losses::SigmoidCrossEntropy)
        correct += 1 if (y[i, 0] < 0 && t[i, 0] < 0.5) || (y[i, 0] >= 0 && t[i, 0] >= 0.5)
      else
        correct += 1 if (y[i, 0] < 0 && t[i, 0] < 0) || (y[i, 0] >= 0 && t[i, 0] >= 0)
      end
    end
  else
    correct = y.max_index(axis: 1).eq(t.max_index(axis: 1)).count
  end
  correct
end

I think this may make Cumo faster.

ruby-dnn & Cumo got faster with version 0.13.0 !
Same benchmark code as above

Numo

real 1m47.222s
user 7m29.562s <- Numo::Linulg!! blazing performance
sys 6m4.522s

 TOTAL    (pct)     SAMPLES    (pct)     FRAME
 12618  (51.0%)       12618  (51.0%)     DNN::Optimizers::RMSProp#update_params
  4510  (18.2%)        4414  (17.8%)     #<Module:0x000056193c6c03a8>.call
  1997   (8.1%)        1997   (8.1%)     DNN::Activations::ReLU#backward
  1279   (5.2%)        1279   (5.2%)     (garbage collection)
   895   (3.6%)         895   (3.6%)     DNN::Activations::ReLU#forward
  5410  (21.9%)         845   (3.4%)     #<Module:0x000056193c6c0718>.dot
  4490  (18.1%)         710   (2.9%)     DNN::Layers::Dense#backward
 13250  (53.6%)         625   (2.5%)     DNN::Optimizers::Optimizer#update
 23446  (94.8%)         316   (1.3%)     DNN::Models::Model#train
   260   (1.1%)         260   (1.1%)     DNN::Losses::SoftmaxCrossEntropy.softmax
   467   (1.9%)         207   (0.8%)     DNN::Losses::SoftmaxCrossEntropy#forward
  1800   (7.3%)         114   (0.5%)     DNN::Layers::Dense#forward
    96   (0.4%)          96   (0.4%)     #<Module:0x000056193c6c0718>.blas_char
  5466  (22.1%)          56   (0.2%)     Numo::NArray#dot
    55   (0.2%)          55   (0.2%)     Numo::NArray.asarray
    24   (0.1%)          24   (0.1%)     DNN::Losses::SoftmaxCrossEntropy#backward
    25   (0.1%)          22   (0.1%)     DNN::Iterator#next_batch
    21   (0.1%)          21   (0.1%)     DNN::Iterator#reset
    21   (0.1%)          21   (0.1%)     DNN::Layers::Layer#built?
    21   (0.1%)          21   (0.1%)     DNN::Link#initialize
    18   (0.1%)          17   (0.1%)     DNN::Models::Model#layers
    19   (0.1%)          17   (0.1%)     DNN::Losses::Loss#regularizers_backward
    15   (0.1%)          15   (0.1%)     DNN::Layers::InputLayer#forward
 22880  (92.5%)          13   (0.1%)     DNN::Models::Model#train_on_batch
    15   (0.1%)          10   (0.0%)     DNN::Losses::Loss#regularizers_forward
 23464  (94.8%)           9   (0.0%)     <top (required)>
     8   (0.0%)           8   (0.0%)     #<Module:0x000056193c890480>.learning_phase=
     7   (0.0%)           7   (0.0%)     DNN::Models::Model#evaluate
     7   (0.0%)           7   (0.0%)     DNN::Layers::Connection#get_params
     7   (0.0%)           7   (0.0%)     DNN::Layers::Connection#regularizers

Cumo

real 1m6.295s <- 1m35.018s
user 0m58.364s
sys 0m12.208s

 TOTAL    (pct)     SAMPLES    (pct)     FRAME
  5954  (50.7%)        5940  (50.6%)     Cumo::NArray#dot
  3250  (27.7%)        3250  (27.7%)     DNN::Activations::ReLU#backward
   898   (7.6%)         898   (7.6%)     DNN::Optimizers::RMSProp#update_params
   526   (4.5%)         526   (4.5%)     Cumo::NArray#to_f
   200   (1.7%)         200   (1.7%)     DNN::Iterator#next_batch
   264   (2.2%)         163   (1.4%)     DNN::Losses::SoftmaxCrossEntropy#forward
  3858  (32.9%)         131   (1.1%)     DNN::Layers::Dense#backward
   101   (0.9%)         101   (0.9%)     DNN::Losses::SoftmaxCrossEntropy.softmax
  2321  (19.8%)          94   (0.8%)     DNN::Layers::Dense#forward
    71   (0.6%)          71   (0.6%)     DNN::Activations::ReLU#forward
    69   (0.6%)          69   (0.6%)     (garbage collection)
   966   (8.2%)          67   (0.6%)     DNN::Optimizers::Optimizer#update
 11660  (99.3%)          41   (0.3%)     DNN::Models::Model#train
   309   (2.6%)          32   (0.3%)     DNN::Losses::Loss#loss
    29   (0.2%)          29   (0.2%)     DNN::Losses::SoftmaxCrossEntropy#backward
    21   (0.2%)          21   (0.2%)     DNN::Models::Model#evaluate
    14   (0.1%)          14   (0.1%)     Cumo::NArray.asarray
    11   (0.1%)          11   (0.1%)     DNN::Iterator#reset
    11   (0.1%)          11   (0.1%)     DNN::Link#initialize
 11674  (99.4%)          10   (0.1%)     <top (required)>
    13   (0.1%)           9   (0.1%)     DNN::Losses::Loss#regularizers_forward
    10   (0.1%)           9   (0.1%)     DNN::Models::Model#layers
     9   (0.1%)           9   (0.1%)     DNN::Layers::Layer#built?
     7   (0.1%)           7   (0.1%)     Cumo::SFloat#mean
  2413  (20.5%)           4   (0.0%)     DNN::Layers::Layer#call
 10832  (92.2%)           4   (0.0%)     DNN::Models::Model#train_on_batch
     4   (0.0%)           4   (0.0%)     DNN::Layers::InputLayer#forward
     4   (0.0%)           4   (0.0%)     DNN::Layers::Connection#regularizers
     2   (0.0%)           2   (0.0%)     DNN::Losses::Loss#regularizers_backward
     3   (0.0%)           1   (0.0%)     <top (required)>

With Google Colab, you can benchmark ruby-dnn in your browser.
Please help yourself !
https://colab.research.google.com/drive/1RJ8HTNI6akqBYZgZWzFve9c6GTz_Tava