`undefined symbol: _Unwind_GetRegionStart` on parse.py
darkmatter08 opened this issue · 8 comments
Hi Aditya -- Cool tool! I'm attempting to run your LeNet example, but I cannot get it to execute. Specifically, the program throws an error I cannot interpret when running parse.py. I suspect it's some kind of environment issue, but I cannot figure out how to fix it.
I created a fresh virtualenv with python 3.5.5. I cloned your package and pip install
'ed it. I ran through your LeNet example. I've provided as many details about my environment as possible, including pip/python versions, cuda version, PATH/LD_LIBRARY_PATH, nvprof --version, etc.
(pyprof2) /data/home/jains/Documents/pyprof2$ pip --version
pip 20.1.1 from /data/home/jains/Documents/env/pyprof2/lib/python3.5/site-packages/pip (python 3.5)
(pyprof2) /data/home/jains/Documents/pyprof2$ pip3 --version
pip 20.1.1 from /data/home/jains/Documents/env/pyprof2/lib/python3.5/site-packages/pip (python 3.5)
(pyprof2) :/data/home/jains/Documents/pyprof2/pyprof2$ python --version
Python 3.5.5 :: Anaconda, Inc.
(pyprof2) /data/home/jains/Documents/pyprof2/pyprof2$ nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2015 NVIDIA Corporation
Release version 7.5.18 (21)
(pyprof2) /data/home/jains/Documents/pyprof2/pyprof2$ nvidia-smi
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Fri May 29 23:37:20 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00002587:00:00.0 Off | 0 |
| N/A 31C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
(pyprof2) /data/home/jains/Documents/pyprof2$ pip install . --user
ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.
(pyprof2) /data/home/jains/Documents/pyprof2$ pip install .
Processing /data/home/jains/Documents/pyprof2
Collecting torch>=1.2.0
Downloading torch-1.5.0-cp35-cp35m-manylinux1_x86_64.whl (752.0 MB) |████████████████████████████████| 752.0 MB 1.8 kB/s
Collecting cxxfilt>=0.2.0
Downloading cxxfilt-0.2.1-py2.py3-none-any.whl (3.9 kB)
Collecting tqdm>=4.35.0
Downloading tqdm-4.46.0-py2.py3-none-any.whl (63 kB) |████████████████████████████████| 63 kB 754 kB/s
Collecting numpy>=1.17.2
Downloading numpy-1.18.4-cp35-cp35m-manylinux1_x86_64.whl (20.0 MB) |████████████████████████████████| 20.0 MB 69.2 MB/s
Processing /data/home/jains/.cache/pip/wheels/a7/c1/ea/cf5bd31012e735dc1dfea3131a2d5eae7978b251083d6247bd/PyYAML-5.3.1-cp35-cp35m-linux_x86_64.whl
Processing /data/home/jains/.cache/pip/wheels/8b/99/a0/81daf51dcd359a9377b110a8a886b3895921802d2fc1b2397e/future-0.18.2-cp35-none-any.whl
Building wheels for collected packages: pyprof2
Building wheel for pyprof2 (setup.py) ... done
Created wheel for pyprof2: filename=pyprof2-1.0-py3-none-any.whl size=4815 sha256=4fd1bc76ef375131174d75c86296905265e2bc6bb970a684b5ad8a4b1f78fe0f
Stored in directory: /tmp/pip-ephem-wheel-cache-6vxwbbkf/wheels/7e/9d/a9/4de7cef177eb736526bf63a3d7b1914962ed6242cef64542b8
Successfully built pyprof2
Installing collected packages: future, numpy, torch, cxxfilt, tqdm, PyYAML, pyprof2
Successfully installed PyYAML-5.3.1 cxxfilt-0.2.1 future-0.18.2 numpy-1.18.4 pyprof2-1.0 torch-1.5.0 tqdm-4.46.0
(pyprof2) /data/home/jains/Documents/pyprof2$ pip freeze
cxxfilt==0.2.1
future==0.18.2
numpy==1.18.4
pyprof2 @ file:///data/home/jains/Documents/pyprof2
PyYAML==5.3.1
torch==1.5.0
tqdm==4.46.0
(pyprof2) /data/home/jains/Documents/pyprof2/pyprof2/examples$ nvprof -f -o net.sql --profile-from-start off -- python lenet.py
Initializing NVTX monkey patches
Done with NVTX monkey patching
==2140== NVPROF is profiling process 2140, command: python lenet.py
==2140== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
==2140== Generated result file: /data/home/jains/Documents/pyprof2/pyprof2/examples/net.sql
(pyprof2) /data/home/jains/Documents/pyprof2/pyprof2/examples$ python ../parse/parse.py net.sql
Traceback (most recent call last):
File "../parse/parse.py", line 12, in <module>
from kernel import Kernel
File "/data/home/jains/Documents/pyprof2/pyprof2/parse/kernel.py", line 1, in <module>
import cxxfilt, struct, binascii
File "/data/home/jains/Documents/env/pyprof2/lib/python3.5/site-packages/cxxfilt/__init__.py", line 42, in <module>
libcxx = ctypes.CDLL(find_any_library('c++', 'stdc++'))
File "/data/anaconda/envs/py35/lib/python3.5/ctypes/__init__.py", line 351, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/x86_64-linux-gnu/libc++.so.1: undefined symbol: _Unwind_GetRegionStart
(pyprof2) /data/home/jains/Documents/pyprof2$ pyprof2/parse/parse.py pyprof2/examples/net.sql
Traceback (most recent call last):
File "pyprof2/parse/parse.py", line 12, in <module>
from kernel import Kernel
File "/data/home/jains/Documents/pyprof2/pyprof2/parse/kernel.py", line 1, in <module>
import cxxfilt, struct, binascii
File "/data/home/jains/Documents/env/pyprof2/lib/python3.5/site-packages/cxxfilt/__init__.py", line 42, in <module>
libcxx = ctypes.CDLL(find_any_library('c++', 'stdc++'))
File "/data/anaconda/envs/py35/lib/python3.5/ctypes/__init__.py", line 351, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/x86_64-linux-gnu/libc++.so.1: undefined symbol: _Unwind_GetRegionStart
(pyprof2) /data/home/jains/Documents/pyprof2$ echo $PATH
/data/home/jains/Documents/env/pyprof2/bin:/home/jains/bin:/home/jains/.local/bin:/home/jains/bin:/home/jains/.local/bin:/data/anaconda/envs/py35/bin:/data/anaconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/caffe/build/install/bin:/usr/local/cuda/bin:/dsvm/tools/cntk/cntk/bin:/dsvm/tools/spark/current/bin:/opt/mssql-tools/bin:/opt/caffe/build/install/bin:/usr/local/cuda/bin:/dsvm/tools/cntk/cntk/bin:/dsvm/tools/spark/current/bin:/opt/mssql-tools/bin
(pyprof2) /data/home/jains/Documents/pyprof2$ echo $LD_LIBRARY_PATH
/opt/intel/compilers_and_libraries_2018.1.163/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.1.163/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/
This is definitely a conda / virtualenv / PATH issue. I can make a few suggestions:
a) Make sure you use pip3 and python3 explicitly.
b) Installing with pip3 ... --user
option installs packages in $HOME/.local/lib/python3.x/...
. Looks like you got an error when you did that. Looks like your $HOME is not setup correctly.
c) From LD_LIBRARY_PATH, looks like you gcc version is 4.7. That's probably 15 years old. Please check and fix your gcc, libc and libstc++ version.
d) From the error it appears the cxxfilt package is looking for a function _unwind_Get....
in the C++ library (libc++) and cannot find it and hence the error. Related to (c).
Temporarily I would suggest not to use conda / virtualenv and just install the package in your regular environment. Let me know if its helps.
Hm. I tried out python3/pip3 explicitly, on my system environment (no virtualenv/conda).
I narrowed down the problem to import cxxfilt
which fails in all of my various python environments with the same error. libc++.so.1: undefined symbol: _Unwind_GetRegionStart
.
Looking at my installed libraries:
/usr/lib/x86_64-linux-gnu$ ll | grep libc++.so
lrwxrwxrwx 1 root root 13 Sep 14 2017 libc++.so.1 -> libc++.so.1.0
-rw-r--r-- 1 root root 932184 Sep 14 2017 libc++.so.1.0
So I suspect that libc++ is outdated. Would you point me to a way to properly upgrade libc++, and suggest which version I should install for pyprof2 to run successfully?
(Also, sorry if this question is out of scope. I appreciate all the help!)
Alternatively, could you point me to a docker image that would work with your tool?
@adityaiitb -- So I tried running with this docker image: pytorch/pytorch:1.4-cuda10.1-cudnn7-devel
. The code executed (didn't see the cxxfilt/libc++ error), however, it found zero kernels following your example for lenet.py
. I believe this was because of the "ERR_NVGPUCTRPERM" issue. I attempted to run with sudo within the container, however, the container doesn't have sudo (makes sense, everything runs as root). I followed the nvidia instructions for "Command Line Control" on the host machine, however, I still faced the same issue. Any idea on how to work around this from within a docker container? (I know this question is a bit out of scope, but again, any help would be really appreciated!!)
root@0526e15ca352:~/code# git clone https://github.com/adityaiitb/pyprof2.git
Cloning into 'pyprof2'...
remote: Enumerating objects: 159, done.
remote: Counting objects: 100% (159/159), done.
remote: Compressing objects: 100% (115/115), done.
remote: Total 159 (delta 74), reused 127 (delta 43), pack-reused 0
Receiving objects: 100% (159/159), 57.89 KiB | 9.65 MiB/s, done.
Resolving deltas: 100% (74/74), done.
root@0526e15ca352:~/code# cd pyprof2/
root@0526e15ca352:~/code/pyprof2# ls
LICENSE README.md pyprof2 requirements.txt setup.py
root@0526e15ca352:~/code/pyprof2/pyprof2# pip --version
pip 19.3.1 from /opt/conda/lib/python3.7/site-packages/pip (python 3.7)
root@0526e15ca352:~/code/pyprof2/pyprof2# python --version
Python 3.7.4
root@0526e15ca352:~/code/pyprof2# pip install . --user
Processing /root/code/pyprof2
Collecting cxxfilt>=0.2.0
Downloading https://files.pythonhosted.org/packages/b9/d9/5cb1e86e11adbca3fc521601a3630cee194595a26adb0f961acac493b791/cxxfilt-0.2.1-py2.py3-none-any.whl
Requirement already satisfied: tqdm>=4.35.0 in /opt/conda/lib/python3.7/site-packages (from pyprof2==1.0) (4.36.1)
Requirement already satisfied: numpy>=1.17.2 in /opt/conda/lib/python3.7/site-packages (from pyprof2==1.0) (1.17.4)
Requirement already satisfied: PyYAML>=5.1 in /opt/conda/lib/python3.7/site-packages (from pyprof2==1.0) (5.2)
Building wheels for collected packages: pyprof2
Building wheel for pyprof2 (setup.py) ... done
Created wheel for pyprof2: filename=pyprof2-1.0-cp37-none-any.whl size=4816 sha256=ffc19a4bfaab37792dd9ad3e3ab91c6f8586594c7e7d9ba4fea2aff079259f46 Stored in directory: /tmp/pip-ephem-wheel-cache-7swlzmmr/wheels/cb/55/9f/7e5ab4f6b47f05fec53ab630e0a2e8185182b745894e5ad812
Successfully built pyprof2
Installing collected packages: cxxfilt, pyprof2
Successfully installed cxxfilt-0.2.1 pyprof2-1.0
root@0526e15ca352:~/code/pyprof2# nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.1.243 (21)
root@0526e15ca352:~/code/pyprof2# cd pyprof2/
root@0526e15ca352:~/code/pyprof2/pyprof2# nvprof -f -o net.sql --profile-from-start off -- python examples/lenet.py
Initializing NVTX monkey patches
Done with NVTX monkey patching
==52== NVPROF is profiling process 52, command: python examples/lenet.py
==52== Warning: ERR_NVGPUCTRPERM - The user does not have permission to profile on the target device. See the following link for instructions to enable permissions and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM
==52== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
==52== Generated result file: /root/code/pyprof2/pyprof2/net.sql
root@0526e15ca352:~/code/pyprof2/pyprof2# parse/parse.py net.sql > net.dict
Found 0 kernels. Exiting.
root@0526e15ca352:~/code/pyprof2/pyprof2# ll -h # note net.dict is 0 bytes!
total 412K
drwxr-xr-x 6 root root 4.0K May 30 17:53 ./
drwxr-xr-x 4 root root 4.0K May 30 17:51 ../
-rw-r--r-- 1 root root 513 May 30 17:51 FAQs.md
-rw-r--r-- 1 root root 79 May 30 17:51 __init__.py
drwxr-xr-x 7 root root 4.0K May 30 17:51 examples/
-rw-r--r-- 1 root root 0 May 30 17:53 net.dict
-rw-r--r-- 1 root root 380K May 30 17:52 net.sql
drwxr-xr-x 2 root root 4.0K May 30 17:51 nvtx/
drwxr-xr-x 3 root root 4.0K May 30 17:53 parse/
drwxr-xr-x 2 root root 4.0K May 30 17:51 prof/
root@0526e15ca352:~/code/pyprof2/pyprof2# rm net.dict
root@0526e15ca352:~/code/pyprof2/pyprof2# parse/parse.py net.sql > net.dict
Found 0 kernels. Exiting.
root@0526e15ca352:~/code/pyprof2/pyprof2# prof/prof.py --csv net.dict > net.csv
root@0526e15ca352:~/code/pyprof2/pyprof2# ll net.csv
-rw-r--r-- 1 root root 66 May 30 17:54 net.csv
root@0526e15ca352:~/code/pyprof2/pyprof2# cat net.csv
"Idx","Direction","Sub","Module","Op","Kernel","Params","Sil(ns)"
Glad that you can get past the previous errors. PyProf is independent of any environments. It works on bare metal, docker containers and conda virualenvs. If you want to use docker (in response to your earlier question) I would suggest using Nvidia's docker containers. They have docker containers for almost all frameworks. https://ngc.nvidia.com/catalog/containers/nvidia:pytorch.
The error you are observing right now is also not a PyProf issue. Its the fact that in your environment NVprof doesnt have privileges to read GPU hardware counters (and hence the profile is empty). Please follow the instructions on the README page. Its a known problem (might no longer be a problem in the latest docker container, but not sure). https://github.com/adityaiitb/pyprof2#hardware-counters.
sudo
within a docker container will not help. You can either fix the problem permanently or temporarily by launching the docker with --privileged
option.
Running with docker's --privileged
option, combined with the hardware counters fix worked! I was able to run your example code! Great stuff -- thanks!
Now, I ran the tool on my code, but I fail on pyprof2/prof/prof.py -w 150 net.dict
. It seems when parsing out the args for Addmm(), you expect 3 args (reasonable) but somehow there are 5 args. No idea why that is the case, nor I attached a log showing the issue below. I also attached net.dict
.
For more context, 64 is the batch size in this example, and I think this is trying to capture info about a Linear Linear (784 -> 500) layer. Not clear why there are int
values mixed in with the tensors -- is this something to do with Transposes perhaps?
root@0db73276e3df:~/code/pyprof2# python -m pdb pyprof2/prof/prof.py -w 150 net.dict
> /root/code/pyprof2/pyprof2/prof/prof.py(14)<module>()
-> """
(Pdb) c
Idx Direc Sub Module Op Kernel Params Sil(ns)
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/pdb.py", line 1701, in main
pdb._runscript(mainpyfile)
File "/opt/conda/lib/python3.7/pdb.py", line 1570, in _runscript
self.run(statement)
File "/opt/conda/lib/python3.7/bdb.py", line 585, in run
exec(cmd, globals, locals)
File "<string>", line 1, in <module>
File "/root/code/pyprof2/pyprof2/prof/prof.py", line 14, in <module>
"""
File "/root/code/pyprof2/pyprof2/prof/prof.py", line 222, in main
xx = foo(mod, op, d)
File "/root/code/pyprof2/pyprof2/prof/prof.py", line 116, in foo
xx = Addmm(d)
File "pyprof2/prof/blas.py", line 41, in __init__
assert (len(args) == 3)
AssertionError
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /root/code/pyprof2/pyprof2/prof/blas.py(41)__init__()
-> assert (len(args) == 3)
(Pdb) !args
[{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}, {'name': '', 'type': 'int', 'value': 1}, {'name': '', 'type': 'tensor', 'shape': (64, 784), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (784, 500), 'dtype': 'float32'}]
(Pdb)
Post mortem debugger finished. The pyprof2/prof/prof.py will be restarted
> /root/code/pyprof2/pyprof2/prof/prof.py(14)<module>()
-> """
(Pdb)
root@0db73276e3df:~/code/pyprof2# python --version
Python 3.7.4
root@0db73276e3df:~/code/pyprof2# pip --version
pip 19.3.1 from /opt/conda/lib/python3.7/site-packages/pip (python 3.7)
net.dict
:
{'kShortName': 'sgemm_32x32x32_NN_vec', 'kDuration': 36062, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:36'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addmm_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}, {'name': '', 'type': 'int', 'value': 1}, {'name': '', 'type': 'tensor', 'shape': (64, 784), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (784, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 400', 'addmm_, seq = 401'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [401], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addmm_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (8, 4, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NN_vec'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1696, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:37'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'fill_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 1}]}"], 'seqMarker': ['linearUnified, seq = 400', 'fill_, seq = 401'], 'seqId': [401], 'subSeqId': 0, 'altSeqId': [400], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['fill_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'ger_kernel', 'kDuration': 3776, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:38'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addr_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 400', 'addr_, seq = 401'], 'seqId': [401], 'subSeqId': 0, 'altSeqId': [400], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addr_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (16, 2, 1), 'block': (256, 1, 1), 'kLongName': 'void ger_kernel<float, float, 256, 5, false>(cublasGerParams<float, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 2464, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/.local/lib/python3.7/site-packages/pyprof2/nvtx/nvmarker.py:95'], 'reprMarkers': [], 'marker': ["{'mod': 'torch.nn.functional', 'op': 'relu', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': 'inplace', 'type': 'bool', 'value': False}]}", "{'mod': 'torch', 'op': 'relu', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['relu, seq = 401'], 'seqId': [401], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['torch.nn.functional', 'torch'], 'op': ['relu', 'relu'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'sgemm_32x32x32_NN_vec', 'kDuration': 24031, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:36'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addmm_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}, {'name': '', 'type': 'int', 'value': 1}, {'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (500, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 402', 'addmm_, seq = 403'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [403], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addmm_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (8, 4, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NN_vec'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1184, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:37'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'fill_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 1}]}"], 'seqMarker': ['linearUnified, seq = 402', 'fill_, seq = 403'], 'seqId': [403], 'subSeqId': 0, 'altSeqId': [402], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['fill_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'ger_kernel', 'kDuration': 2944, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:38'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addr_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 402', 'addr_, seq = 403'], 'seqId': [403], 'subSeqId': 0, 'altSeqId': [402], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addr_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (16, 2, 1), 'block': (256, 1, 1), 'kLongName': 'void ger_kernel<float, float, 256, 5, false>(cublasGerParams<float, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1792, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/.local/lib/python3.7/site-packages/pyprof2/nvtx/nvmarker.py:95'], 'reprMarkers': [], 'marker': ["{'mod': 'torch.nn.functional', 'op': 'relu', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': 'inplace', 'type': 'bool', 'value': False}]}", "{'mod': 'torch', 'op': 'relu', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['relu, seq = 403'], 'seqId': [403], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['torch.nn.functional', 'torch'], 'op': ['relu', 'relu'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'sgemm_32x32x32_NN', 'kDuration': 27678, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:36'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addmm_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}, {'name': '', 'type': 'int', 'value': 1}, {'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (500, 10), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 404', 'addmm_, seq = 405'], 'seqId': [404], 'subSeqId': 0, 'altSeqId': [405], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addmm_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (2, 1, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NN'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1184, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:37'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'fill_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 1}]}"], 'seqMarker': ['linearUnified, seq = 404', 'fill_, seq = 405'], 'seqId': [405], 'subSeqId': 0, 'altSeqId': [404], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['fill_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'ger_kernel', 'kDuration': 2816, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:38'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addr_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (10,), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 404', 'addr_, seq = 405'], 'seqId': [405], 'subSeqId': 0, 'altSeqId': [404], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addr_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 2, 1), 'block': (256, 1, 1), 'kLongName': 'void ger_kernel<float, float, 256, 5, false>(cublasGerParams<float, float>)'}
{'kShortName': 'softmax_warp_forward', 'kDuration': 3743, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/.local/lib/python3.7/site-packages/pyprof2/nvtx/nvmarker.py:95'], 'reprMarkers': [], 'marker': ["{'mod': 'torch.nn.functional', 'op': 'log_softmax', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}]}", "{'mod': 'Tensor', 'op': 'log_softmax', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 1}]}"], 'seqMarker': ['_log_softmax, seq = 405'], 'seqId': [405], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['torch.nn.functional', 'Tensor'], 'op': ['log_softmax', 'log_softmax'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (4, 1, 1), 'block': (16, 8, 1), 'kLongName': 'void (anonymous namespace)::softmax_warp_forward<float, float, float, 4, true>(float*, float const*, int, int, int)'}
{'kShortName': 'cunn_ClassNLLCriterion_updateOutput_kernel', 'kDuration': 7776, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:135'], 'reprMarkers': [], 'marker': ["{'mod': 'torch.nn.functional', 'op': 'nll_loss', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'int64'}]}"], 'seqMarker': ['nll_loss, seq = 406'], 'seqId': [406], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['torch.nn.functional'], 'op': ['nll_loss'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (32, 1, 1), 'kLongName': 'void cunn_ClassNLLCriterion_updateOutput_kernel<float, float>(float*, float*, float*, long*, float*, int, int, int, int, long)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 3040, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:141', '/root/.local/lib/python3.7/site-packages/pyprof2/nvtx/nvmarker.py:95'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'backward', 'args': [{'name': '', 'type': 'float', 'value': 0.35390961170196533}]}", "{'mod': 'torch', 'op': 'ones_like', 'args': [{'name': '', 'type': 'float', 'value': 0.35390961170196533}]}"], 'seqMarker': [], 'seqId': [], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['Tensor', 'torch'], 'op': ['backward', 'ones_like'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<128, 4, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#4}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#4})'}
{'kShortName': 'cunn_ClassNLLCriterion_updateGradInput_kernel', 'kDuration': 4928, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['nll_loss_backward, seq = 0', 'NllLossBackward, seq = 406'], 'seqId': [406], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na', 'na'], 'op': ['nll_loss', 'NllLoss'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (32, 1, 1), 'kLongName': 'void cunn_ClassNLLCriterion_updateGradInput_kernel<float>(float*, float*, long*, float*, float*, int, int, int, int, long)'}
{'kShortName': 'softmax_warp_backward', 'kDuration': 2784, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['_log_softmax_backward_data, seq = 0', 'LogSoftmaxBackward, seq = 405'], 'seqId': [405], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na'], 'op': ['LogSoftmax'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (4, 1, 1), 'block': (16, 8, 1), 'kLongName': 'void (anonymous namespace)::softmax_warp_backward<float, float, float, 4, true>(float*, float const*, float const*, int, int, int)'}
{'kShortName': 'sgemm_32x32x32_NT', 'kDuration': 13311, 'layer': [], 'trace': ['/root/code/functions.py:68'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (10, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 404'], 'seqId': [404], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (8, 4, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NT'}
{'kShortName': 'sgemm_32x32x32_TN', 'kDuration': 12959, 'layer': [], 'trace': ['/root/code/functions.py:70'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 404'], 'seqId': [404], 'subSeqId': 1, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (4, 1, 4), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_TN'}
{'kShortName': 'gemv2N_kernel', 'kDuration': 6656, 'layer': [], 'trace': ['/root/code/functions.py:72'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mv', 'args': [{'name': '', 'type': 'tensor', 'shape': (10, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}]}"], 'seqMarker': ['mv, seq = 0', 'linearUnifiedLegacyBackward, seq = 404'], 'seqId': [404], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mv'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (3, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void gemv2N_kernel<int, int, float, float, float, 128, 32, 4, 4, 1, cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float> >(cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 2112, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 0, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (10, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1600, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 1, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1920, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['threshold_backward, seq = 0', 'ReluBackward0, seq = 403'], 'seqId': [403], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na'], 'op': ['threshold'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 2048, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'abs', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['abs, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['abs'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1}>(at::TensorIterator&, at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1}>(at::TensorIterator&, at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'reduce_kernel', 'kDuration': 13567, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'sum', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}]}"], 'seqMarker': ['sum, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['sum'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (32, 16, 1), 'kLongName': 'void at::native::reduce_kernel<512, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_kernel_impl<float, float, float>(at::TensorIterator&)::{lambda(float, float)#1}>, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_kernel_impl<float, float, float>(at::TensorIterator&)::{lambda(float, float)#1}>, unsigned int, float, 4>)'}
{'kShortName': 'gatherTopK', 'kDuration': 25663, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'topk', 'args': [{'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 80}]}"], 'seqMarker': ['topk, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['topk'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void gatherTopK<float, unsigned int, 1, true>(TensorInfo<float, unsigned int>, unsigned int, unsigned int, unsigned int, unsigned int, TensorInfo<float, unsigned int>, unsigned int, unsigned int, TensorInfo<long, unsigned int>, unsigned int)'}
{'kShortName': 'bitonicSortKVInPlace', 'kDuration': 19487, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'topk', 'args': [{'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 80}]}"], 'seqMarker': ['topk, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 1, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['topk'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (64, 1, 1), 'kLongName': 'void bitonicSortKVInPlace<float, long, -2, -1, GTComp<float, true>, unsigned int, 128>(TensorInfo<float, unsigned int>, unsigned int, unsigned int, unsigned int, TensorInfo<long, unsigned int>, unsigned int, GTComp<float, true>)'}
{'kShortName': 'indexSelectLargeIndex', 'kDuration': 3488, 'layer': [], 'trace': ['/root/code/functions.py:54'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_select', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}]}"], 'seqMarker': ['index_select, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_select'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (40, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexSelectLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'indexSelectLargeIndex', 'kDuration': 3648, 'layer': [], 'trace': ['/root/code/functions.py:59'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_select', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}]}"], 'seqMarker': ['index_select, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 1, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_select'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (313, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexSelectLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'sgemm_32x32x32_NT_vec', 'kDuration': 13311, 'layer': [], 'trace': ['/root/code/functions.py:59'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 80), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (80, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (8, 4, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NT_vec'}
{'kShortName': 'elementwise_kernel', 'kDuration': 3520, 'layer': [], 'trace': ['/root/code/functions.py:61'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'zero_', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['zero_, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['zero_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (489, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'sgemm_32x32x32_TN_vec', 'kDuration': 12128, 'layer': [], 'trace': ['/root/code/functions.py:62'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64, 80), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (12, 1, 4), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_TN_vec'}
{'kShortName': 'indexCopyLargeIndex', 'kDuration': 6400, 'layer': [], 'trace': ['/root/code/functions.py:62'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_copy_', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}, {'name': '', 'type': 'tensor', 'shape': (500, 80), 'dtype': 'float32'}]}"], 'seqMarker': ['index_copy_, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_copy_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (313, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexCopyLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'gemv2N_kernel', 'kDuration': 4320, 'layer': [], 'trace': ['/root/code/functions.py:65'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mv', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}]}"], 'seqMarker': ['mv, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mv'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (125, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void gemv2N_kernel<int, int, float, float, float, 128, 32, 4, 4, 1, cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float> >(cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 5344, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 0, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (489, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1696, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 1, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 2304, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['threshold_backward, seq = 0', 'ReluBackward0, seq = 401'], 'seqId': [401], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na'], 'op': ['threshold'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1696, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'abs', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['abs, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['abs'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1}>(at::TensorIterator&, at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1}>(at::TensorIterator&, at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'reduce_kernel', 'kDuration': 9919, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'sum', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}]}"], 'seqMarker': ['sum, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['sum'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (32, 16, 1), 'kLongName': 'void at::native::reduce_kernel<512, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_kernel_impl<float, float, float>(at::TensorIterator&)::{lambda(float, float)#1}>, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_kernel_impl<float, float, float>(at::TensorIterator&)::{lambda(float, float)#1}>, unsigned int, float, 4>)'}
{'kShortName': 'gatherTopK', 'kDuration': 15423, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'topk', 'args': [{'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 80}]}"], 'seqMarker': ['topk, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['topk'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void gatherTopK<float, unsigned int, 1, true>(TensorInfo<float, unsigned int>, unsigned int, unsigned int, unsigned int, unsigned int, TensorInfo<float, unsigned int>, unsigned int, unsigned int, TensorInfo<long, unsigned int>, unsigned int)'}
{'kShortName': 'bitonicSortKVInPlace', 'kDuration': 12224, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'topk', 'args': [{'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 80}]}"], 'seqMarker': ['topk, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 1, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['topk'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (64, 1, 1), 'kLongName': 'void bitonicSortKVInPlace<float, long, -2, -1, GTComp<float, true>, unsigned int, 128>(TensorInfo<float, unsigned int>, unsigned int, unsigned int, unsigned int, TensorInfo<long, unsigned int>, unsigned int, GTComp<float, true>)'}
{'kShortName': 'indexSelectLargeIndex', 'kDuration': 2752, 'layer': [], 'trace': ['/root/code/functions.py:54'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_select', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}]}"], 'seqMarker': ['index_select, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_select'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (40, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexSelectLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 4639, 'layer': [], 'trace': ['/root/code/functions.py:61'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'zero_', 'args': [{'name': '', 'type': 'tensor', 'shape': (784, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['zero_, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['zero_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (766, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'sgemm_32x32x32_TN_vec', 'kDuration': 6976, 'layer': [], 'trace': ['/root/code/functions.py:62'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (784, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64, 80), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (15, 1, 5), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_TN_vec'}
{'kShortName': 'indexCopyLargeIndex', 'kDuration': 6560, 'layer': [], 'trace': ['/root/code/functions.py:62'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_copy_', 'args': [{'name': '', 'type': 'tensor', 'shape': (784, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}, {'name': '', 'type': 'tensor', 'shape': (784, 80), 'dtype': 'float32'}]}"], 'seqMarker': ['index_copy_, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_copy_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (448, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexCopyLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'gemv2N_kernel', 'kDuration': 3968, 'layer': [], 'trace': ['/root/code/functions.py:65'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mv', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}]}"], 'seqMarker': ['mv, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mv'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (125, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void gemv2N_kernel<int, int, float, float, float, 128, 32, 4, 4, 1, cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float> >(cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1471, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['zero_, seq = 0', 'zeros, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na'], 'op': ['linearUnifiedLegacy'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (98, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 8767, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 0, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (766, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1728, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 1, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
Great to know you can run hello world! Thanks for trying out the tool and pointing out the bug.
- I just fixed the bug and now you should be able to run
prof.py
. - Addmm can receive 5 arguments, 3 tensors and 2 scalars. The two scalars are
alpha
andbeta
. The tool captures all arguments. - The bug arose because PyTorch is a fast moving target and sometimes APIs can change slightly between versions causing bugs / assertions. E.g. the function signature of
addmm
did change slightly from 1.2 to 1.5. I was unaware until you pointed it out.
v1.2: https://pytorch.org/docs/1.2.0/torch.html#torch.addmm
v1.5: https://pytorch.org/docs/stable/torch.html#torch.addmm
Once you confirm, I can close the bug.
@adityaiitb -- The fix seems to work, I'm able to generate csv files of the profiling results. Thanks!