masatoi/cl-random-forest

implement distributed architecture

incjung opened this issue · 9 comments

Recently, I have forked your random-forest repo to implement distributed architecture.
I planed to use lfarm instead of lparallel.

In your current repo, I found 8 parts that are using lparallel.

  • make-refine-vector
  • make-refine-dataset
  • train-refine-learner-multiclass
  • set-activation-matrix!
  • make-forest
  • make-regression-forest
  • forest-feature-importance
  • forest-feature-importance-impurity

First of all, I begin to modify make-refine-vector and train-refine-learner-multiclass.

(defun lfarm-leaf-index-list-task (datamatrix datum-index forest)
  (lfarm:pmapcar
    (lambda (dtree)
      (let ((node (find-leaf (dtree-root dtree) datamatrix datum-index)))
        (node-leaf-index node)))
    (forest-dtree-list forest)))

(defun make-refine-vector (forest datamatrix datum-index)
  (let ((index-offset (forest-index-offset forest))
        (n-tree (forest-n-tree forest)))
    (declare (optimize (speed 3) (safety 0))
             (type (simple-array double-float) datamatrix)
             (type (simple-array fixnum) index-offset)
             (type fixnum datum-index n-tree))
    (let ((leaf-index-list (lfarm-leaf-index-list-task datamatrix datum-index forest)))
      (let ((sv-index (make-array (forest-n-tree forest) :element-type 'fixnum))
            (sv-val (make-array (forest-n-tree forest)
                                :element-type 'double-float :initial-element 1d0)))
        (declare (type (simple-array fixnum) sv-index)
                 (type (simple-array double-float) sv-val))
        (loop for i fixnum from 0 to (1- n-tree)
              for index fixnum in leaf-index-list
              do (setf (aref sv-index i) (+ index (aref index-offset i))))
        (clol.vector:make-sparse-vector sv-index sv-val)))))

(lfarm:deftask pdotimes (class-id len sv-vec refine-dataset learners target)
  (loop for datum-id fixnum from 0 to (1- len) do
      (setf (clol.vector:sparse-vector-index-vector (svref sv-vec class-id))
            (svref refine-dataset datum-id))
      (clol:sparse-arow-update (svref learners class-id)
                               (svref sv-vec class-id)
                               (if (= (aref target datum-id) class-id) 1d0 -1d0)))
  (svref learners class-id))

(defun train-refine-learner-multiclass (refine-learner refine-dataset target)
  (let* ((len (length refine-dataset))
         (n-tree (length (svref refine-dataset 0)))
         (n-class (clol::one-vs-rest-n-class refine-learner))
         (learners (clol::one-vs-rest-learners-vector refine-learner))
         (sv-index (make-array n-tree :element-type 'fixnum :initial-element 0))
         (sv-val (make-array n-tree :element-type 'double-float :initial-element 1d0))
         (sv-vec (make-array n-class)))
    (loop for i fixnum from 0 to (1- n-class) do
      (setf (svref sv-vec i)
            (clol.vector:make-sparse-vector sv-index sv-val)))
    (let ((channel (lfarm:make-channel)))
      (dotimes (class-id n-class)
        (lfarm:submit-task channel #'pdotomes class-id len sv-vec refine-dataset learners target)
        (setf (svref learners class-id) (lfarm:receive-result channel))))
    ))

When I tried this, what I felt is :

In cl-random-forest, there are some reference objects (ex, learner, clol.vector, etc) in order to make forest model. lparallel can use setf or change those because it is working multi-threads on one process. But lfarm can not because it is running on different processes.

Could you give me any comments or helps? welcome to anything.

Thanks.

Inchul

Do you know where the individual trees are built? Perhaps instead of a vector, individual nodes need to compute the trees and return the results? Just a guess here.

Hi.

Individual trees are built in definition of make-forest.

(push-ntimes n-tree (forest-dtree-list forest)
      (make-dtree n-class datamatrix target
                  :max-depth max-depth
                  :min-region-samples min-region-samples
                  :n-trial n-trial
                  :gain-test gain-test
                  :remove-sample-indices? remove-sample-indices?
                  :save-parent-node? save-parent-node?
                  :sample-indices (bootstrap-sample-indices
                                   (floor (* (array-dimension datamatrix 0) bagging-ratio))
                                   datamatrix)))

The push-ntimes macro is expanded to lparallel's pdotimes macro and pushed the result in parallel in the order finished. It seems that lfarm doesn't have pdotimes, then this can be replaced like pmapcar simply.

I think individual nodes should have sub-forest for multithreading. Therefore it may be better to distribute make-forest rather than make-dtree.

Hi, Masatoi
Thank you for feedback.
As you pointed, I agreed that we don't need to reimplement all lparallel parts into lfarm but only important part.

Hi,

I modified push-ntimes macro to use lfarm:pmapcar

(defmacro push-ntimes (n lst &body body)
  "choose multi-process/ multi-thread/ single"
  (let ((var (gensym)))
    `(if lfarm:*kernel*
         (progn
           (format t "lfarm *kernel* : ~A" lfarm:*kernel*)
           (setf ,lst (lfarm:pmapcar (lambda (_) ,@body) (alexandria:iota ,n))))
         (if lparallel:*kernel*
             (lparallel:pdotimes (,var ,n)
               (progn
                 ,var
                 (push (progn ,@body) ,lst)))
             (dotimes (,var ,n)
               (progn
                 ,var
                 (push (progn ,@body) ,lst)))))))

Eventually, I tested it with your front page exam and confirmed to be working.

In near future, I would test more with larger dataset such as MNIST,... . (any recommend?) and let you know the result.

Hi, I have tested this lfarm version on my macbook laptop.

I used mnist.lisp to test with 3 type test cases: 1) only lfarm 2) lfarm + lparallel 3) only lparallel.

 ;; (lfarm-server:start-server "0.0.0.0" 11111 :background t)
 ;; (lfarm-server:start-server "0.0.0.0" 44444 :background t)
 (setf lfarm:*kernel* (lfarm:make-kernel '(("127.0.0.1" 11111)
                                           ("127.0.0.1" 44444))))
(setf lfarm:*kernel* nil)

;;; Enable/Disable parallelizaion
(setf lparallel:*kernel* (lparallel:make-kernel 2))
(setf lparallel:*kernel* nil)

(time
(defparameter mnist-forest
  (make-forest mnist-n-class mnist-datamatrix mnist-target
               :n-tree 500 :bagging-ratio 0.1 :max-depth 10 :n-trial 10 :min-region-samples 5))
)

Test result with MINIST dataset:

case 1 : only lparallel)

(DEFPARAMETER MNIST-FOREST (MAKE-FOREST MNIST-N-CLASS MNIST-DATAMATRIX MNIST-TARGET :N-TREE 500 :BAGGING-RATIO 0.1 :MAX-DEPTH 10 :N-TRIAL 10 :MIN-REGION-SAMPLES 5))
took 50,888,376 microseconds (50.888374 seconds) to run.
     10,878,775 microseconds (10.878775 seconds, 21.38%) of which was spent in GC.
During that period, and with 4 available CPU cores,
     58,696,605 microseconds (58.696606 seconds) were spent in user mode
     16,648,581 microseconds (16.648580 seconds) were spent in system mode
 37,136 bytes of memory allocated.
 17,435 minor page faults, 0 major page faults, 0 swaps.

case 2 : lparallel + lfarm)

; No valuelfarm *kernel* : #<KERNEL :WORKER-COUNT 2 #x302052EFA5CD>
(DEFPARAMETER MNIST-FOREST (MAKE-FOREST MNIST-N-CLASS MNIST-DATAMATRIX MNIST-TARGET :N-TREE 500 :BAGGING-RATIO 0.1 :MAX-DEPTH 10 :N-TRIAL 10 :MIN-REGION-SAMPLES 5))
took 303,950,544 microseconds (303.950530 seconds) to run.
      80,855,172 microseconds ( 80.855170 seconds, 26.60%) of which was spent in GC.
During that period, and with 4 available CPU cores,
     415,548,121 microseconds (415.548130 seconds) were spent in user mode
      33,401,655 microseconds ( 33.401653 seconds) were spent in system mode
 66,512 bytes of memory allocated.
 597,599 minor page faults, 0 major page faults, 0 swaps.

case 3 : only lfarm)

lfarm *kernel* : #<KERNEL :WORKER-COUNT 2 #x30204CC5C37D>
(DEFPARAMETER MNIST-FOREST (MAKE-FOREST MNIST-N-CLASS MNIST-DATAMATRIX MNIST-TARGET :N-TREE 500 :BAGGING-RATIO 0.1 :MAX-DEPTH 10 :N-TRIAL 10 :MIN-REGION-SAMPLES 5))
took 311,230,258 microseconds (311.230300 seconds) to run.
      85,614,539 microseconds ( 85.614530 seconds, 27.51%) of which was spent in GC.
During that period, and with 4 available CPU cores,
     415,987,408 microseconds (415.987400 seconds) were spent in user mode
      33,889,908 microseconds ( 33.889908 seconds) were spent in system mode
 66,512 bytes of memory allocated.
 603,075 minor page faults, 0 major page faults, 0 swaps.

Looks like that the lfarm (mutl-process) is not helping improve the performance of cl-randomforest.
I am looking for the reason.

welcoming to any feedback.

Thanks

Perhaps this would improve if using multiple machines? Could there be additional overhead with lfarm vs. lparallel on a single machine?

For any significant speedup I believe the dataset and the model should be large.

I am suspicious of overhead during forking. Perhaps because dataset is copied to all process.

I distributed the dataset to each node beforehand and trained sub-forests independently. Since my compute node is not powerful, I experimented with iris dataset instead of MNIST.

(ql:quickload :lfarm-client)

(defpackage clrf-multinode
  (:use :cl :cl-random-forest))

(in-package :clrf-multinode)

;; Create two servers (Eval at server side)
;; (ql:quickload :lfarm-server)
;; (lfarm-server:start-server "49.212.153.220" 11524 :background t) ; ayame
;; (lfarm-server:start-server "49.212.137.221" 11524 :background t) ; kakitsubata

;; Connect to the servers
(setf lfarm:*kernel* (lfarm:make-kernel '(("49.212.153.220" 11524)
                                          ("49.212.137.221" 11524))))

;; Load library
(lfarm:broadcast-task #'ql:quickload :cl-random-forest)
(lfarm:broadcast-task (lambda () (defpackage :clrf-multinode (:use :cl :cl-random-forest))))

;; Init random-state
(lfarm:broadcast-task
 (lambda ()
   (setq *random-state* (make-random-state t))))

;; Load dataset. It is necessary to distribute the dataset file to each server beforehand
(lfarm:broadcast-task
 (lambda ()
   (defparameter dim 4)
   (defparameter n-class 3)

   (multiple-value-bind (datamat target)
       (clrf.utils:read-data "/home/wiz/datasets/iris.scale" dim)
     (defparameter datamatrix datamat)
     (defparameter target target))))

;; Train
(lfarm:broadcast-task
 (lambda ()
   ;;(setf lparallel:*kernel* (lparallel:make-kernel 2))
   (defparameter forest
     (make-forest n-class datamatrix target
                  :n-tree 100 :bagging-ratio 1.0 :min-region-samples 5 :n-trial 10 :max-depth 5))))

When predicting, compute the class distributions on each server's sub-forests and then average it at the client.

;; Predict
(defun predict-forest-multinode (index)
  (let* ((result (lfarm:broadcast-task
                  `(lambda ()
                     (clrf::class-distribution-forest forest datamatrix ,index))))
         (n-node (length result))
         (n-dim (length (aref result 0)))
         (class-dist (make-array n-dim :element-type 'double-float :initial-element 0d0)))
    (print result)
    (loop for res across result do
      (loop for i from 0 below n-dim
            for elem across res do
              (incf (aref class-dist i) elem)))
    (loop for i from 0 below n-dim do
      (setf (aref class-dist i) (/ (aref class-dist i) n-node)))
    class-dist))

(predict-forest-multinode 0)
;; => #(0.9836666666666667d0 0.014666666666666668d0 0.0016666666666666666d0)