Drawbacks of making the C++ API look like Python
dannypike opened this issue · 10 comments
Thank you for creating a C++ version of Pytorch. However, I wonder if you could create an example that looks like C++ and not like Python?
The DCGAN sample project makes extensive use of auto
so that it can show how it can be made to look and feel like Python by avoiding standard C++ things like unique_ptr<>, shared_ptr<> etc.
However, I am a C++ programmer, not a Python programmer. I am very happy working with standard C++ things like classes with methods and smart pointers. The noble attempt to make "feel like Python" with auto
variables isn't helpful for me. For example, it assumes that I will be able to put my entire program into a single method. That's an unfortunate restriction, as I want to build, store and pass objects between a number of different methods.
I have tried unwrapping the auto
using some decltype() statements, but the Pytorch C++ templating makes this quite laborious. Perhaps that is an unavoidable result of the way that the underlying library is built? If so, could you create an C++ example that shows how to unwrap the various templates in one case, splitting the operations across several methods of a class for me?
Would that be straightforward to do? It would be a great help for me to get an idea of how your templating structure works and I can then build up from that.
I've only just started working with the library (that's why I'm looking at the example), so maybe I've missed something in the tutorial? I apologize if that's the case and ask if you would point me at the example that I should be looking at?
Many thanks,
Dan
@dannypike That's interesting feedback, perhaps @lancerts has bandwidth to help out
Hi @dannypike, thanks for the feedback.
- Feel free to use the following (from effective c++) instead of
decltype()
. I trieddecltype()
and it doesn't work nicely in this scenario.
using boost::typeindex::type_id_with_cvr;
template <typename T> void f(const T ¶m) {
using std::cout;
// show T
cout << "T = " << type_id_with_cvr<T>().pretty_name() << '\n';
// show param's type
cout << "param = " << type_id_with_cvr<decltype(param)>().pretty_name()
<< '\n';
};
-
A helpful example for my start is cpp/custom-dataset, which touches the template in line.
-
Do you have particular models/functionalities in mind? It would be interesting to build an example with a focus on the standard c++.
Thank you very much, @lancerts.
The custom dataset sample looks like it will work much better as an introduction for me. It separates the four phases of data acquisition, model definition, training and verification, which makes it a lot easier for me to "absorb", as a tutorial for the library.
At the moment, I do not have particular models that I want to build. I've been periodically reading up on the progress of AI, since I first worked on them back in the 1980s (that's why I'm a C++ coder). Way back then, simple multi-threading was a "cool" feature in a PC; no-one had conceived of being able to play with a TPU. and we were only experimenting with simple MLPs.
More recently, I've been playing with Llama2 and now Llama3 and, although I'm not particularly interested in LLMs as such, it's good to see how modern hardware makes machine learning a practical thing on "home" hardware.
That's why I'm now trying to learn Pytorch.
Hello, again. I've just built and run the sample (I'm doing this as a hobby project, in my spare time), but it doesn't seem to train for me "straight out of the repo".
These are the stats that I see:
Train Epoch: 1 400/7281 Loss: 0.0121669 Acc: 0
Train Epoch: 1 600/7281 Loss: 0.0159285 Acc: 0.00333333
Train Epoch: 1 800/7281 Loss: 0.0177225 Acc: 0.01
Train Epoch: 1 1000/7281 Loss: 0.0186938 Acc: 0.016
Train Epoch: 1 1200/7281 Loss: 0.0192844 Acc: 0.0283333
Train Epoch: 1 1400/7281 Loss: 0.0196614 Acc: 0.0371429
Train Epoch: 1 1600/7281 Loss: 0.0199116 Acc: 0.046875
Train Epoch: 1 1800/7281 Loss: 0.0201089 Acc: 0.0527778
Train Epoch: 1 2000/7281 Loss: 0.0202756 Acc: 0.057
Train Epoch: 1 2200/7281 Loss: 0.0203775 Acc: 0.06
Train Epoch: 1 2400/7281 Loss: 0.0204334 Acc: 0.0666667
Train Epoch: 1 2600/7281 Loss: 0.0204973 Acc: 0.0680769
Train Epoch: 1 2800/7281 Loss: 0.020593 Acc: 0.0692857
Train Epoch: 1 3000/7281 Loss: 0.0206406 Acc: 0.072
Train Epoch: 1 3200/7281 Loss: 0.020712 Acc: 0.07125
Train Epoch: 1 3400/7281 Loss: 0.0207482 Acc: 0.0738235
Train Epoch: 1 3600/7281 Loss: 0.0207949 Acc: 0.0738889
Train Epoch: 1 3800/7281 Loss: 0.0208227 Acc: 0.0734211
Train Epoch: 1 4000/7281 Loss: 0.0208353 Acc: 0.075
Train Epoch: 1 4200/7281 Loss: 0.0208277 Acc: 0.077381
Train Epoch: 1 4400/7281 Loss: 0.0208338 Acc: 0.0786364
Train Epoch: 1 4600/7281 Loss: 0.0208781 Acc: 0.0780435
Train Epoch: 1 4800/7281 Loss: 0.0208602 Acc: 0.0804167
Train Epoch: 1 5000/7281 Loss: 0.0208587 Acc: 0.0812
Train Epoch: 1 5200/7281 Loss: 0.0208668 Acc: 0.0817308
Train Epoch: 1 5400/7281 Loss: 0.020885 Acc: 0.0825926
Train Epoch: 1 5600/7281 Loss: 0.020907 Acc: 0.0826786
Train Epoch: 1 5800/7281 Loss: 0.0209292 Acc: 0.0827586
Train Epoch: 1 6000/7281 Loss: 0.0209262 Acc: 0.0843333
Train Epoch: 1 6200/7281 Loss: 0.0209183 Acc: 0.085
Train Epoch: 1 6400/7281 Loss: 0.0209223 Acc: 0.0854687
Train Epoch: 1 6600/7281 Loss: 0.020917 Acc: 0.0857576
Train Epoch: 1 6800/7281 Loss: 0.0208983 Acc: 0.0866176
Train Epoch: 1 7000/7281 Loss: 0.020905 Acc: 0.0874286
Train Epoch: 1 7200/7281 Loss: 0.0209088 Acc: 0.0884722
Train Epoch: 1 7281/7281 Loss: 0.0212617 Acc: 0.0902349
Train Epoch: 1 7281/7281 Loss: 0.0218321 Acc: 0.091883
The accuracy climbs steadily, which is what I would expect, but the loss looks wrong. It's hovering around zero and mostly increases by a very small delta.
I don't see anything in the sample code that appears to be initializing the model tensors to random values. Is that the problem, or is it happening behind the scenes? If it's OK, do you have any idea why I see what I see?
This is on Windows 10 running Visual Studio 2022, and CUDA 12.4 on a (very old) GeForce 750 with only 2GB VRAM
I've just tried building it on WSL (Windows Subsystem for Linux) and I get the following error when I run the final make
:
(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset$ mkdir build
(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset$ cd build
(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset/build$ cmake -DCMAKE_PREFIX_PA
TH=/mnt/e/Projects/AI/libtorch ..
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Torch: /mnt/e/Projects/AI/libtorch/lib/libtorch.so
-- Found OpenCV: /usr/local (found version "4.9.0") found components: core imgproc imgcodecs
-- OpenCV include dirs: /usr/local/include/opencv4
-- OpenCV libraries: opencv_core;opencv_imgproc;opencv_imgcodecs
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset/build
(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset/build$ make
Scanning dependencies of target custom-dataset
[ 50%] Building CXX object CMakeFiles/custom-dataset.dir/custom-dataset.cpp.o
[100%] Linking CXX executable custom-dataset
/usr/bin/ld: CMakeFiles/custom-dataset.dir/custom-dataset.cpp.o: undefined reference to symbol 'pthread_create@@GLIBC_2.2.5'
/usr/bin/ld: /lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/custom-dataset.dir/build.make:91: custom-dataset] Error 1
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/custom-dataset.dir/all] Error 2
make: *** [Makefile:84: all] Error 2
(ai) dan@NUTMEG:/mnt/e/Projects/AI/ubuntu/pytorch/examples/cpp/custom-dataset/build$
I'm very noob on Linux, but I think I've done everything is says in your README for custom-dataset.
Do you know why it can't find pthreads?
@lancerts, thank you.
-
I believe that I was using that version of libtorch. The zip file that I unpacked is called "libtorch-cxx11-abi-shared-with-deps-2.3.0+cpu.zip".
-
I did a quick web search for that error and one of the results says that I need to link to the file explicitly (presumably libpthread?). I could try adding extra files to the CMakeLists.txt, but I didn't want to do that in case I introduce another problem. I know how sensitive these things can be given the technology is changing so fast.
Do you think it's worth updating the CMakeLists.txt to add libpthread as a link target, and test with that?
@lancerts, thank you.
- I believe that I was using that version of libtorch. The zip file that I unpacked is called "libtorch-cxx11-abi-shared-with-deps-2.3.0+cpu.zip".
- I did a quick web search for that error and one of the results says that I need to link to the file explicitly (presumably libpthread?). I could try adding extra files to the CMakeLists.txt, but I didn't want to do that in case I introduce another problem. I know how sensitive these things can be given the technology is changing so fast.
Do you think it's worth updating the CMakeLists.txt to add libpthread as a link target, and test with that?
Yeah, certainly, it should be -lpthread
.
In addition, for CPU run, you can refer to the
Hello, again. I've just built and run the sample (I'm doing this as a hobby project, in my spare time), but it doesn't seem to train for me "straight out of the repo".
These are the stats that I see:
Train Epoch: 1 400/7281 Loss: 0.0121669 Acc: 0 Train Epoch: 1 600/7281 Loss: 0.0159285 Acc: 0.00333333 Train Epoch: 1 800/7281 Loss: 0.0177225 Acc: 0.01 Train Epoch: 1 1000/7281 Loss: 0.0186938 Acc: 0.016 Train Epoch: 1 1200/7281 Loss: 0.0192844 Acc: 0.0283333 Train Epoch: 1 1400/7281 Loss: 0.0196614 Acc: 0.0371429 Train Epoch: 1 1600/7281 Loss: 0.0199116 Acc: 0.046875 Train Epoch: 1 1800/7281 Loss: 0.0201089 Acc: 0.0527778 Train Epoch: 1 2000/7281 Loss: 0.0202756 Acc: 0.057 Train Epoch: 1 2200/7281 Loss: 0.0203775 Acc: 0.06 Train Epoch: 1 2400/7281 Loss: 0.0204334 Acc: 0.0666667 Train Epoch: 1 2600/7281 Loss: 0.0204973 Acc: 0.0680769 Train Epoch: 1 2800/7281 Loss: 0.020593 Acc: 0.0692857 Train Epoch: 1 3000/7281 Loss: 0.0206406 Acc: 0.072 Train Epoch: 1 3200/7281 Loss: 0.020712 Acc: 0.07125 Train Epoch: 1 3400/7281 Loss: 0.0207482 Acc: 0.0738235 Train Epoch: 1 3600/7281 Loss: 0.0207949 Acc: 0.0738889 Train Epoch: 1 3800/7281 Loss: 0.0208227 Acc: 0.0734211 Train Epoch: 1 4000/7281 Loss: 0.0208353 Acc: 0.075 Train Epoch: 1 4200/7281 Loss: 0.0208277 Acc: 0.077381 Train Epoch: 1 4400/7281 Loss: 0.0208338 Acc: 0.0786364 Train Epoch: 1 4600/7281 Loss: 0.0208781 Acc: 0.0780435 Train Epoch: 1 4800/7281 Loss: 0.0208602 Acc: 0.0804167 Train Epoch: 1 5000/7281 Loss: 0.0208587 Acc: 0.0812 Train Epoch: 1 5200/7281 Loss: 0.0208668 Acc: 0.0817308 Train Epoch: 1 5400/7281 Loss: 0.020885 Acc: 0.0825926 Train Epoch: 1 5600/7281 Loss: 0.020907 Acc: 0.0826786 Train Epoch: 1 5800/7281 Loss: 0.0209292 Acc: 0.0827586 Train Epoch: 1 6000/7281 Loss: 0.0209262 Acc: 0.0843333 Train Epoch: 1 6200/7281 Loss: 0.0209183 Acc: 0.085 Train Epoch: 1 6400/7281 Loss: 0.0209223 Acc: 0.0854687 Train Epoch: 1 6600/7281 Loss: 0.020917 Acc: 0.0857576 Train Epoch: 1 6800/7281 Loss: 0.0208983 Acc: 0.0866176 Train Epoch: 1 7000/7281 Loss: 0.020905 Acc: 0.0874286 Train Epoch: 1 7200/7281 Loss: 0.0209088 Acc: 0.0884722 Train Epoch: 1 7281/7281 Loss: 0.0212617 Acc: 0.0902349 Train Epoch: 1 7281/7281 Loss: 0.0218321 Acc: 0.091883
The accuracy climbs steadily, which is what I would expect, but the loss looks wrong. It's hovering around zero and mostly increases by a very small delta.
I don't see anything in the sample code that appears to be initializing the model tensors to random values. Is that the problem, or is it happening behind the scenes? If it's OK, do you have any idea why I see what I see?
This is on Windows 10 running Visual Studio 2022, and CUDA 12.4 on a (very old) GeForce 750 with only 2GB VRAM
Hypothesis: CUDA12.4 is causing issues. https://pytorch.org/get-started/locally/ libtorch c++ only supports 11.8 and 12.1.
Thank you; that's much better!