pytorch/examples

Drawbacks of making the C++ API look like Python

dannypike opened this issue · 10 comments

Thank you for creating a C++ version of Pytorch. However, I wonder if you could create an example that looks like C++ and not like Python?

The DCGAN sample project makes extensive use of auto so that it can show how it can be made to look and feel like Python by avoiding standard C++ things like unique_ptr<>, shared_ptr<> etc.

However, I am a C++ programmer, not a Python programmer. I am very happy working with standard C++ things like classes with methods and smart pointers. The noble attempt to make "feel like Python" with auto variables isn't helpful for me. For example, it assumes that I will be able to put my entire program into a single method. That's an unfortunate restriction, as I want to build, store and pass objects between a number of different methods.

I have tried unwrapping the auto using some decltype() statements, but the Pytorch C++ templating makes this quite laborious. Perhaps that is an unavoidable result of the way that the underlying library is built? If so, could you create an C++ example that shows how to unwrap the various templates in one case, splitting the operations across several methods of a class for me?

Would that be straightforward to do? It would be a great help for me to get an idea of how your templating structure works and I can then build up from that.

I've only just started working with the library (that's why I'm looking at the example), so maybe I've missed something in the tutorial? I apologize if that's the case and ask if you would point me at the example that I should be looking at?

Many thanks,

Dan

@dannypike That's interesting feedback, perhaps @lancerts has bandwidth to help out

Hi @dannypike, thanks for the feedback.

  • Feel free to use the following (from effective c++) instead of decltype(). I tried decltype() and it doesn't work nicely in this scenario.
using boost::typeindex::type_id_with_cvr;
template <typename T> void f(const T &param) {

  using std::cout;
  // show T
  cout << "T = " << type_id_with_cvr<T>().pretty_name() << '\n';
  // show param's type
  cout << "param = " << type_id_with_cvr<decltype(param)>().pretty_name()
       << '\n';
};
  • A helpful example for my start is cpp/custom-dataset, which touches the template in line.

  • Do you have particular models/functionalities in mind? It would be interesting to build an example with a focus on the standard c++.

Thank you very much, @lancerts.

The custom dataset sample looks like it will work much better as an introduction for me. It separates the four phases of data acquisition, model definition, training and verification, which makes it a lot easier for me to "absorb", as a tutorial for the library.

At the moment, I do not have particular models that I want to build. I've been periodically reading up on the progress of AI, since I first worked on them back in the 1980s (that's why I'm a C++ coder). Way back then, simple multi-threading was a "cool" feature in a PC; no-one had conceived of being able to play with a TPU. and we were only experimenting with simple MLPs.

More recently, I've been playing with Llama2 and now Llama3 and, although I'm not particularly interested in LLMs as such, it's good to see how modern hardware makes machine learning a practical thing on "home" hardware.

That's why I'm now trying to learn Pytorch.

Hello, again. I've just built and run the sample (I'm doing this as a hobby project, in my spare time), but it doesn't seem to train for me "straight out of the repo".

These are the stats that I see:

Train Epoch: 1 400/7281	Loss: 0.0121669	Acc: 0
Train Epoch: 1 600/7281	Loss: 0.0159285	Acc: 0.00333333
Train Epoch: 1 800/7281	Loss: 0.0177225	Acc: 0.01
Train Epoch: 1 1000/7281	Loss: 0.0186938	Acc: 0.016
Train Epoch: 1 1200/7281	Loss: 0.0192844	Acc: 0.0283333
Train Epoch: 1 1400/7281	Loss: 0.0196614	Acc: 0.0371429
Train Epoch: 1 1600/7281	Loss: 0.0199116	Acc: 0.046875
Train Epoch: 1 1800/7281	Loss: 0.0201089	Acc: 0.0527778
Train Epoch: 1 2000/7281	Loss: 0.0202756	Acc: 0.057
Train Epoch: 1 2200/7281	Loss: 0.0203775	Acc: 0.06
Train Epoch: 1 2400/7281	Loss: 0.0204334	Acc: 0.0666667
Train Epoch: 1 2600/7281	Loss: 0.0204973	Acc: 0.0680769
Train Epoch: 1 2800/7281	Loss: 0.020593	Acc: 0.0692857
Train Epoch: 1 3000/7281	Loss: 0.0206406	Acc: 0.072
Train Epoch: 1 3200/7281	Loss: 0.020712	Acc: 0.07125
Train Epoch: 1 3400/7281	Loss: 0.0207482	Acc: 0.0738235
Train Epoch: 1 3600/7281	Loss: 0.0207949	Acc: 0.0738889
Train Epoch: 1 3800/7281	Loss: 0.0208227	Acc: 0.0734211
Train Epoch: 1 4000/7281	Loss: 0.0208353	Acc: 0.075
Train Epoch: 1 4200/7281	Loss: 0.0208277	Acc: 0.077381
Train Epoch: 1 4400/7281	Loss: 0.0208338	Acc: 0.0786364
Train Epoch: 1 4600/7281	Loss: 0.0208781	Acc: 0.0780435
Train Epoch: 1 4800/7281	Loss: 0.0208602	Acc: 0.0804167
Train Epoch: 1 5000/7281	Loss: 0.0208587	Acc: 0.0812
Train Epoch: 1 5200/7281	Loss: 0.0208668	Acc: 0.0817308
Train Epoch: 1 5400/7281	Loss: 0.020885	Acc: 0.0825926
Train Epoch: 1 5600/7281	Loss: 0.020907	Acc: 0.0826786
Train Epoch: 1 5800/7281	Loss: 0.0209292	Acc: 0.0827586
Train Epoch: 1 6000/7281	Loss: 0.0209262	Acc: 0.0843333
Train Epoch: 1 6200/7281	Loss: 0.0209183	Acc: 0.085
Train Epoch: 1 6400/7281	Loss: 0.0209223	Acc: 0.0854687
Train Epoch: 1 6600/7281	Loss: 0.020917	Acc: 0.0857576
Train Epoch: 1 6800/7281	Loss: 0.0208983	Acc: 0.0866176
Train Epoch: 1 7000/7281	Loss: 0.020905	Acc: 0.0874286
Train Epoch: 1 7200/7281	Loss: 0.0209088	Acc: 0.0884722
Train Epoch: 1 7281/7281	Loss: 0.0212617	Acc: 0.0902349
Train Epoch: 1 7281/7281	Loss: 0.0218321	Acc: 0.091883

The accuracy climbs steadily, which is what I would expect, but the loss looks wrong. It's hovering around zero and mostly increases by a very small delta.

I don't see anything in the sample code that appears to be initializing the model tensors to random values. Is that the problem, or is it happening behind the scenes? If it's OK, do you have any idea why I see what I see?

This is on Windows 10 running Visual Studio 2022, and CUDA 12.4 on a (very old) GeForce 750 with only 2GB VRAM

I've just tried building it on WSL (Windows Subsystem for Linux) and I get the following error when I run the final make:

(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset$ mkdir build
(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset$ cd build
(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset/build$ cmake -DCMAKE_PREFIX_PA
TH=/mnt/e/Projects/AI/libtorch ..
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Torch: /mnt/e/Projects/AI/libtorch/lib/libtorch.so
-- Found OpenCV: /usr/local (found version "4.9.0") found components: core imgproc imgcodecs
-- OpenCV include dirs: /usr/local/include/opencv4
-- OpenCV libraries: opencv_core;opencv_imgproc;opencv_imgcodecs
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset/build
(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset/build$ make
Scanning dependencies of target custom-dataset
[ 50%] Building CXX object CMakeFiles/custom-dataset.dir/custom-dataset.cpp.o
[100%] Linking CXX executable custom-dataset
/usr/bin/ld: CMakeFiles/custom-dataset.dir/custom-dataset.cpp.o: undefined reference to symbol 'pthread_create@@GLIBC_2.2.5'
/usr/bin/ld: /lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/custom-dataset.dir/build.make:91: custom-dataset] Error 1
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/custom-dataset.dir/all] Error 2
make: *** [Makefile:84: all] Error 2
(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset/build$

I'm very noob on Linux, but I think I've done everything is says in your README for custom-dataset.

Do you know why it can't find pthreads?

@dannypike

  1. Can you try libtorch-cxx11-abi-shared-with-deps instead of libtorch-shared-with-deps? I encountered a different issue (but could be related issue) here.
  2. The complete build/run code for cpp/custom-dataset is here and a run here.

@lancerts, thank you.

  1. I believe that I was using that version of libtorch. The zip file that I unpacked is called "libtorch-cxx11-abi-shared-with-deps-2.3.0+cpu.zip".

  2. I did a quick web search for that error and one of the results says that I need to link to the file explicitly (presumably libpthread?). I could try adding extra files to the CMakeLists.txt, but I didn't want to do that in case I introduce another problem. I know how sensitive these things can be given the technology is changing so fast.

Do you think it's worth updating the CMakeLists.txt to add libpthread as a link target, and test with that?

@lancerts, thank you.

  1. I believe that I was using that version of libtorch. The zip file that I unpacked is called "libtorch-cxx11-abi-shared-with-deps-2.3.0+cpu.zip".
  2. I did a quick web search for that error and one of the results says that I need to link to the file explicitly (presumably libpthread?). I could try adding extra files to the CMakeLists.txt, but I didn't want to do that in case I introduce another problem. I know how sensitive these things can be given the technology is changing so fast.

Do you think it's worth updating the CMakeLists.txt to add libpthread as a link target, and test with that?

Yeah, certainly, it should be -lpthread.

In addition, for CPU run, you can refer to the

  • environment setup here [argparse is not needed for custom-dataset example]
  • libtorch setup here
  • build custom-dataset here

Hello, again. I've just built and run the sample (I'm doing this as a hobby project, in my spare time), but it doesn't seem to train for me "straight out of the repo".

These are the stats that I see:

Train Epoch: 1 400/7281	Loss: 0.0121669	Acc: 0
Train Epoch: 1 600/7281	Loss: 0.0159285	Acc: 0.00333333
Train Epoch: 1 800/7281	Loss: 0.0177225	Acc: 0.01
Train Epoch: 1 1000/7281	Loss: 0.0186938	Acc: 0.016
Train Epoch: 1 1200/7281	Loss: 0.0192844	Acc: 0.0283333
Train Epoch: 1 1400/7281	Loss: 0.0196614	Acc: 0.0371429
Train Epoch: 1 1600/7281	Loss: 0.0199116	Acc: 0.046875
Train Epoch: 1 1800/7281	Loss: 0.0201089	Acc: 0.0527778
Train Epoch: 1 2000/7281	Loss: 0.0202756	Acc: 0.057
Train Epoch: 1 2200/7281	Loss: 0.0203775	Acc: 0.06
Train Epoch: 1 2400/7281	Loss: 0.0204334	Acc: 0.0666667
Train Epoch: 1 2600/7281	Loss: 0.0204973	Acc: 0.0680769
Train Epoch: 1 2800/7281	Loss: 0.020593	Acc: 0.0692857
Train Epoch: 1 3000/7281	Loss: 0.0206406	Acc: 0.072
Train Epoch: 1 3200/7281	Loss: 0.020712	Acc: 0.07125
Train Epoch: 1 3400/7281	Loss: 0.0207482	Acc: 0.0738235
Train Epoch: 1 3600/7281	Loss: 0.0207949	Acc: 0.0738889
Train Epoch: 1 3800/7281	Loss: 0.0208227	Acc: 0.0734211
Train Epoch: 1 4000/7281	Loss: 0.0208353	Acc: 0.075
Train Epoch: 1 4200/7281	Loss: 0.0208277	Acc: 0.077381
Train Epoch: 1 4400/7281	Loss: 0.0208338	Acc: 0.0786364
Train Epoch: 1 4600/7281	Loss: 0.0208781	Acc: 0.0780435
Train Epoch: 1 4800/7281	Loss: 0.0208602	Acc: 0.0804167
Train Epoch: 1 5000/7281	Loss: 0.0208587	Acc: 0.0812
Train Epoch: 1 5200/7281	Loss: 0.0208668	Acc: 0.0817308
Train Epoch: 1 5400/7281	Loss: 0.020885	Acc: 0.0825926
Train Epoch: 1 5600/7281	Loss: 0.020907	Acc: 0.0826786
Train Epoch: 1 5800/7281	Loss: 0.0209292	Acc: 0.0827586
Train Epoch: 1 6000/7281	Loss: 0.0209262	Acc: 0.0843333
Train Epoch: 1 6200/7281	Loss: 0.0209183	Acc: 0.085
Train Epoch: 1 6400/7281	Loss: 0.0209223	Acc: 0.0854687
Train Epoch: 1 6600/7281	Loss: 0.020917	Acc: 0.0857576
Train Epoch: 1 6800/7281	Loss: 0.0208983	Acc: 0.0866176
Train Epoch: 1 7000/7281	Loss: 0.020905	Acc: 0.0874286
Train Epoch: 1 7200/7281	Loss: 0.0209088	Acc: 0.0884722
Train Epoch: 1 7281/7281	Loss: 0.0212617	Acc: 0.0902349
Train Epoch: 1 7281/7281	Loss: 0.0218321	Acc: 0.091883

The accuracy climbs steadily, which is what I would expect, but the loss looks wrong. It's hovering around zero and mostly increases by a very small delta.

I don't see anything in the sample code that appears to be initializing the model tensors to random values. Is that the problem, or is it happening behind the scenes? If it's OK, do you have any idea why I see what I see?

This is on Windows 10 running Visual Studio 2022, and CUDA 12.4 on a (very old) GeForce 750 with only 2GB VRAM

Hypothesis: CUDA12.4 is causing issues. https://pytorch.org/get-started/locally/ libtorch c++ only supports 11.8 and 12.1.

Thank you; that's much better!