sinkingsugar/nimtorch

Question. Import model trained under a Python / Torch library.

UNIcodeX opened this issue · 17 comments

Would it be possible or even advisable to import a pth or pkl, which was trained using FastAI, into NimTorch, for the purpose of exposing in a backend written in Nim (for efficiency and speed)?

Sure, it was even planned but never implemented sadly as the project didn't really manage to attract a community of users.
Also limitations of nim (E.g. the GC) would limit the efficiency.

What would be the limitations of the Nim GC? As I understand it, the Nim GC is not "stop the world" like Python's GIL. Also I'm seeing a good bit of talk about replacing the default Nim GC for ARC (reference counting) and ORC (reference counting with a cycle collector).

In light of all of this, would there still be an issue in supporting import of Python FastAI V1 trained models into NimTorch?

A GC won't be able to track properly GPU memory and most of all in this case has a very poor interaction with the C++ side.

In general GCs are generally bad in machine learning, the original torch is a study book on the topic.

Arc/orc: not sure really, my experience with using them was very unstable and poor in terms of performance. But maybe they are getting better.

Arc can technically support a scope-based custom allocator; developing better support for such an animal would be a huge attraction to both Nim and a GPU-based ML library.

Yes ARC could work well with C++ RAII, but in practice I encountered many issues. Maybe things leak less now.

I don’t know when you tried, but without issues, it’s hard to know what rough spots need polish.

We need more demanding tests, but FWIW, these days I expect arc to work and I expect my idiomatic code to be quite a bit faster than when the same code runs under refc.

It’s useful to hear under what circumstances this isn’t the case

I don't have much context, it was just a regular projects with many refs inside seqs and some nesting, they were all leaking.. but it was a few months ago. Would have to run it again at some point and see.
Orc also was utterly slow.. like 10x slowdowns.. maybe things changed again.. altho honestly without proper lifetime management I expect ARC/ORC to be slow, specially in the case of cycles.

I also think I must not be doing something correctly with regards to arc/orc. When testing Jester with orc on the latest Nim@#devel I was getting speeds of 3600 requests per second, whereas markAndSweep was around 4500.

That said, I'm very excited about where things are headed for Nim and ecosystem, and would be VERY interested in NimTorch supporting import of my Python / FastAI trained models. I'm sure, as disruptek states, it would be quite a draw to Nim from Python developers looking for more interop / speed.

Well, if you can give me a repro, maybe we can find and fix the problem. 3600 versus 4500 is not a surprising difference between arc and m&s, partly because I doubt Jester is particularly sympathetic, and a delta of 20% on an artificial test is rather meaningless, as I’ve mentioned.

When I started measuring slow benchmarks, arc took over 4s to do a base64 bench that was 1.5s under m&s. These are the worst deltas that I’ve seen.

That bench was down to 1.85s under arc a couple weeks ago (before views). To me, given the slope of improvement and the ROI, that’s good enough.

It’s easy to write code that is very fast under arc, and that’s part of what’s attractive about it. I’m not sure what is meant by “proper lifetime management” but again, don’t be shy: let’s see what sucks and fix it. 😉

Upon running these tests, it became apparent that the slow down with arc/orc occurs when using threads. When not using threads, orc is quite performant indeed.

Please advise if there is a better way to handle {.threadvar.} usage.

Using Nim 1.3.5 and testing with wrk2 -t2 -c20 -d10s -R200000 http://localhost:8000/ I get the following results:

(Requests per second are an average from 3 consecutive runs per test)

--threads:on

45767 r/s : nim c -r -d:release -d:danger --threads:on --gc:refc main
56876 r/s : nim c -r -d:release -d:danger --threads:on --gc:markAndSweep main
 9014 r/s : nim c -r -d:release -d:danger --threads:on --gc:orc main

--threads:off

23418 r/s : nim c -r -d:release -d:danger --threads:on --gc:refc main
29404 r/s : nim c -r -d:release -d:danger --threads:on --gc:markAndSweep main
28239 r/s : nim c -r -d:release -d:danger --threads:on --gc:orc main

main.nim

import strformat
import strutils
import re
import os

import jester
import json
import asyncnet
import asyncdispatch
import nwt

var
  tvTemplates {.threadvar.}: Nwt

proc getTemplates(): Nwt =
  if tvTemplates.isNil:                         # If thread local `tvTemplates` is Nil,
    tvTemplates = newNwt("templates/*.html")    # initialize for the current thread,
  return tvTemplates                            # and finally return it.

settings:
  port = 8000.Port
  staticDir = "static"
  appName = "" 
  bindAddr = "0.0.0.0"
  reusePort = false

routes:
  get "/":
    let templates = getTemplates()              # Retrieve the thread local `tvTemplates`,
    resp templates.renderTemplate("index.html") # and use it.
  
  get re"^\/(.*)\.(?:html|txt|css|js|min\.js|jpg|png|bmp|svg|gif)$":
    if "templates" notin request.path:
      sendFile(request.path.strip(chars={'/'}, leading=true))
  
  get "/explicitJSON":
    const data = $(%*{"message": "Hello, World!"})
    resp data, "application/json"
  
  get "/explicitJsonFromSeq":
    let test = @[
      %*{
        "message": "Hello, World!"
      },
      %*{
        "message2": @[
          %*{
            "nested": "works",
          }
        ]
      },
    ] 
    resp ($test).strip(chars={'@'}, leading=true), "application/json"
  
  get "/implicitJSON":
    resp %*{
      "string": "string",
      "number": 1,
      "float": 1.33
    }
  
  # get re"^\/(.*)\.txt$":
  #   resp request.matches[0]

This should probably be filed and discussed elsewhere.

@disruptek Sure. Where should I put it? Under Jester or under Nim?

@sinkingsugar Back to the issue of loading a FastAI model under NimTorch, I only need CPU support for now anyway. My use case is to serve predictions from a cloud machine without a nice GPU to use.

It looks like a Nim problem based upon your repro, but we will need to minimize it further.

Meaning (A) to remove all the other routes or (B) a completely different example that uses threading without Jester?

Ideally, we can remove the jester dependency as well.

I'm trying to repro this, but I'm not sure how... httpbeast does not use parallel, so that could be why the following does not elicit the issue.

import times, threadpool, strformat

{. experimental: "parallel" .}

type 
  MyType = ref object
    val: int

var
  tVar {.threadvar.} : MyType

proc newMyType*(val: int = 0): MyType =
  result = MyType()

proc initVar(): MyType =
  if tVar.isNil:
    echo "initializing"
    tVar = newMyType()
  return tVar

proc process(i: int) =
  var tVar = initVar()
  tVar.val += i
  # echo tVar

when isMainModule:
  let 
    t0 = cpuTime()
    seconds = 10.0
  var count = 0
  while cpuTime() - t0 < seconds:
    parallel:
      for i in 1..20:
        spawn process(i)
        inc count
    sync()
  echo &"{count} changes in {seconds} seconds."

@sinkingsugar I am willing to help in adding support for importing .pth and .pkl files, if you could give me some direction on what needs to be done.