adityaiitb/pyprof2

`undefined symbol: _Unwind_GetRegionStart` on parse.py

darkmatter08 opened this issue · 8 comments

Hi Aditya -- Cool tool! I'm attempting to run your LeNet example, but I cannot get it to execute. Specifically, the program throws an error I cannot interpret when running parse.py. I suspect it's some kind of environment issue, but I cannot figure out how to fix it.

I created a fresh virtualenv with python 3.5.5. I cloned your package and pip install'ed it. I ran through your LeNet example. I've provided as many details about my environment as possible, including pip/python versions, cuda version, PATH/LD_LIBRARY_PATH, nvprof --version, etc.

(pyprof2) /data/home/jains/Documents/pyprof2$ pip --version
pip 20.1.1 from /data/home/jains/Documents/env/pyprof2/lib/python3.5/site-packages/pip (python 3.5)
(pyprof2) /data/home/jains/Documents/pyprof2$ pip3 --version
pip 20.1.1 from /data/home/jains/Documents/env/pyprof2/lib/python3.5/site-packages/pip (python 3.5)
(pyprof2) :/data/home/jains/Documents/pyprof2/pyprof2$ python --version
Python 3.5.5 :: Anaconda, Inc.
(pyprof2) /data/home/jains/Documents/pyprof2/pyprof2$ nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2015 NVIDIA Corporation
Release version 7.5.18 (21)

(pyprof2) /data/home/jains/Documents/pyprof2/pyprof2$ nvidia-smi
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Fri May 29 23:37:20 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00002587:00:00.0 Off |                    0 |
| N/A   31C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

(pyprof2) /data/home/jains/Documents/pyprof2$ pip install . --user
ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.
(pyprof2) /data/home/jains/Documents/pyprof2$ pip install .
Processing /data/home/jains/Documents/pyprof2
Collecting torch>=1.2.0
    Downloading torch-1.5.0-cp35-cp35m-manylinux1_x86_64.whl (752.0 MB)                                                                                     |████████████████████████████████| 752.0 MB 1.8 kB/s
Collecting cxxfilt>=0.2.0
    Downloading cxxfilt-0.2.1-py2.py3-none-any.whl (3.9 kB)
Collecting tqdm>=4.35.0
    Downloading tqdm-4.46.0-py2.py3-none-any.whl (63 kB)                                                                                                    |████████████████████████████████| 63 kB 754 kB/s
Collecting numpy>=1.17.2
    Downloading numpy-1.18.4-cp35-cp35m-manylinux1_x86_64.whl (20.0 MB)                                                                                     |████████████████████████████████| 20.0 MB 69.2 MB/s
Processing /data/home/jains/.cache/pip/wheels/a7/c1/ea/cf5bd31012e735dc1dfea3131a2d5eae7978b251083d6247bd/PyYAML-5.3.1-cp35-cp35m-linux_x86_64.whl
Processing /data/home/jains/.cache/pip/wheels/8b/99/a0/81daf51dcd359a9377b110a8a886b3895921802d2fc1b2397e/future-0.18.2-cp35-none-any.whl
Building wheels for collected packages: pyprof2
Building wheel for pyprof2 (setup.py) ... done
Created wheel for pyprof2: filename=pyprof2-1.0-py3-none-any.whl size=4815 sha256=4fd1bc76ef375131174d75c86296905265e2bc6bb970a684b5ad8a4b1f78fe0f
  Stored in directory: /tmp/pip-ephem-wheel-cache-6vxwbbkf/wheels/7e/9d/a9/4de7cef177eb736526bf63a3d7b1914962ed6242cef64542b8
Successfully built pyprof2
Installing collected packages: future, numpy, torch, cxxfilt, tqdm, PyYAML, pyprof2
Successfully installed PyYAML-5.3.1 cxxfilt-0.2.1 future-0.18.2 numpy-1.18.4 pyprof2-1.0 torch-1.5.0 tqdm-4.46.0

(pyprof2) /data/home/jains/Documents/pyprof2$ pip freeze
cxxfilt==0.2.1
future==0.18.2
numpy==1.18.4
pyprof2 @ file:///data/home/jains/Documents/pyprof2
PyYAML==5.3.1
torch==1.5.0
tqdm==4.46.0

(pyprof2) /data/home/jains/Documents/pyprof2/pyprof2/examples$ nvprof -f -o net.sql --profile-from-start off -- python lenet.py
Initializing NVTX monkey patches
Done with NVTX monkey patching
==2140== NVPROF is profiling process 2140, command: python lenet.py
==2140== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
==2140== Generated result file: /data/home/jains/Documents/pyprof2/pyprof2/examples/net.sql

(pyprof2) /data/home/jains/Documents/pyprof2/pyprof2/examples$ python ../parse/parse.py net.sql
Traceback (most recent call last):
  File "../parse/parse.py", line 12, in <module>
    from kernel import Kernel
  File "/data/home/jains/Documents/pyprof2/pyprof2/parse/kernel.py", line 1, in <module>
    import cxxfilt, struct, binascii
  File "/data/home/jains/Documents/env/pyprof2/lib/python3.5/site-packages/cxxfilt/__init__.py", line 42, in <module>
    libcxx = ctypes.CDLL(find_any_library('c++', 'stdc++'))
  File "/data/anaconda/envs/py35/lib/python3.5/ctypes/__init__.py", line 351, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/x86_64-linux-gnu/libc++.so.1: undefined symbol: _Unwind_GetRegionStart

(pyprof2) /data/home/jains/Documents/pyprof2$ pyprof2/parse/parse.py pyprof2/examples/net.sql
Traceback (most recent call last):
  File "pyprof2/parse/parse.py", line 12, in <module>
    from kernel import Kernel
  File "/data/home/jains/Documents/pyprof2/pyprof2/parse/kernel.py", line 1, in <module>
    import cxxfilt, struct, binascii
  File "/data/home/jains/Documents/env/pyprof2/lib/python3.5/site-packages/cxxfilt/__init__.py", line 42, in <module>
    libcxx = ctypes.CDLL(find_any_library('c++', 'stdc++'))
  File "/data/anaconda/envs/py35/lib/python3.5/ctypes/__init__.py", line 351, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/x86_64-linux-gnu/libc++.so.1: undefined symbol: _Unwind_GetRegionStart

(pyprof2) /data/home/jains/Documents/pyprof2$ echo $PATH
/data/home/jains/Documents/env/pyprof2/bin:/home/jains/bin:/home/jains/.local/bin:/home/jains/bin:/home/jains/.local/bin:/data/anaconda/envs/py35/bin:/data/anaconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/opt/caffe/build/install/bin:/usr/local/cuda/bin:/dsvm/tools/cntk/cntk/bin:/dsvm/tools/spark/current/bin:/opt/mssql-tools/bin:/opt/caffe/build/install/bin:/usr/local/cuda/bin:/dsvm/tools/cntk/cntk/bin:/dsvm/tools/spark/current/bin:/opt/mssql-tools/bin

(pyprof2) /data/home/jains/Documents/pyprof2$ echo $LD_LIBRARY_PATH
/opt/intel/compilers_and_libraries_2018.1.163/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.1.163/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/

This is definitely a conda / virtualenv / PATH issue. I can make a few suggestions:

a) Make sure you use pip3 and python3 explicitly.
b) Installing with pip3 ... --user option installs packages in $HOME/.local/lib/python3.x/.... Looks like you got an error when you did that. Looks like your $HOME is not setup correctly.
c) From LD_LIBRARY_PATH, looks like you gcc version is 4.7. That's probably 15 years old. Please check and fix your gcc, libc and libstc++ version.
d) From the error it appears the cxxfilt package is looking for a function _unwind_Get.... in the C++ library (libc++) and cannot find it and hence the error. Related to (c).

Temporarily I would suggest not to use conda / virtualenv and just install the package in your regular environment. Let me know if its helps.

Hm. I tried out python3/pip3 explicitly, on my system environment (no virtualenv/conda).
I narrowed down the problem to import cxxfilt which fails in all of my various python environments with the same error. libc++.so.1: undefined symbol: _Unwind_GetRegionStart.
Looking at my installed libraries:

/usr/lib/x86_64-linux-gnu$ ll | grep libc++.so
lrwxrwxrwx   1 root root        13 Sep 14  2017 libc++.so.1 -> libc++.so.1.0
-rw-r--r--   1 root root    932184 Sep 14  2017 libc++.so.1.0

So I suspect that libc++ is outdated. Would you point me to a way to properly upgrade libc++, and suggest which version I should install for pyprof2 to run successfully?

(Also, sorry if this question is out of scope. I appreciate all the help!)

Alternatively, could you point me to a docker image that would work with your tool?

@adityaiitb -- So I tried running with this docker image: pytorch/pytorch:1.4-cuda10.1-cudnn7-devel. The code executed (didn't see the cxxfilt/libc++ error), however, it found zero kernels following your example for lenet.py. I believe this was because of the "ERR_NVGPUCTRPERM" issue. I attempted to run with sudo within the container, however, the container doesn't have sudo (makes sense, everything runs as root). I followed the nvidia instructions for "Command Line Control" on the host machine, however, I still faced the same issue. Any idea on how to work around this from within a docker container? (I know this question is a bit out of scope, but again, any help would be really appreciated!!)

root@0526e15ca352:~/code# git clone https://github.com/adityaiitb/pyprof2.git
Cloning into 'pyprof2'...
remote: Enumerating objects: 159, done.
remote: Counting objects: 100% (159/159), done.
remote: Compressing objects: 100% (115/115), done.
remote: Total 159 (delta 74), reused 127 (delta 43), pack-reused 0
Receiving objects: 100% (159/159), 57.89 KiB | 9.65 MiB/s, done.
Resolving deltas: 100% (74/74), done.

root@0526e15ca352:~/code# cd pyprof2/

root@0526e15ca352:~/code/pyprof2# ls
LICENSE  README.md  pyprof2  requirements.txt  setup.py

root@0526e15ca352:~/code/pyprof2/pyprof2# pip --version
pip 19.3.1 from /opt/conda/lib/python3.7/site-packages/pip (python 3.7)

root@0526e15ca352:~/code/pyprof2/pyprof2# python --version
Python 3.7.4

root@0526e15ca352:~/code/pyprof2# pip install . --user
Processing /root/code/pyprof2
Collecting cxxfilt>=0.2.0
  Downloading https://files.pythonhosted.org/packages/b9/d9/5cb1e86e11adbca3fc521601a3630cee194595a26adb0f961acac493b791/cxxfilt-0.2.1-py2.py3-none-any.whl
Requirement already satisfied: tqdm>=4.35.0 in /opt/conda/lib/python3.7/site-packages (from pyprof2==1.0) (4.36.1)
Requirement already satisfied: numpy>=1.17.2 in /opt/conda/lib/python3.7/site-packages (from pyprof2==1.0) (1.17.4)
Requirement already satisfied: PyYAML>=5.1 in /opt/conda/lib/python3.7/site-packages (from pyprof2==1.0) (5.2)
Building wheels for collected packages: pyprof2
  Building wheel for pyprof2 (setup.py) ... done
  Created wheel for pyprof2: filename=pyprof2-1.0-cp37-none-any.whl size=4816 sha256=ffc19a4bfaab37792dd9ad3e3ab91c6f8586594c7e7d9ba4fea2aff079259f46  Stored in directory: /tmp/pip-ephem-wheel-cache-7swlzmmr/wheels/cb/55/9f/7e5ab4f6b47f05fec53ab630e0a2e8185182b745894e5ad812
Successfully built pyprof2
Installing collected packages: cxxfilt, pyprof2
Successfully installed cxxfilt-0.2.1 pyprof2-1.0

root@0526e15ca352:~/code/pyprof2# nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.1.243 (21)

root@0526e15ca352:~/code/pyprof2# cd pyprof2/

root@0526e15ca352:~/code/pyprof2/pyprof2# nvprof -f -o net.sql --profile-from-start off -- python examples/lenet.py
Initializing NVTX monkey patches
Done with NVTX monkey patching
==52== NVPROF is profiling process 52, command: python examples/lenet.py
==52== Warning: ERR_NVGPUCTRPERM - The user does not have permission to profile on the target device. See the following link for instructions to enable permissions and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM
==52== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
==52== Generated result file: /root/code/pyprof2/pyprof2/net.sql

root@0526e15ca352:~/code/pyprof2/pyprof2# parse/parse.py net.sql > net.dict
Found 0 kernels. Exiting.

root@0526e15ca352:~/code/pyprof2/pyprof2# ll -h  # note net.dict is 0 bytes!
total 412K
drwxr-xr-x 6 root root 4.0K May 30 17:53 ./
drwxr-xr-x 4 root root 4.0K May 30 17:51 ../
-rw-r--r-- 1 root root  513 May 30 17:51 FAQs.md
-rw-r--r-- 1 root root   79 May 30 17:51 __init__.py
drwxr-xr-x 7 root root 4.0K May 30 17:51 examples/
-rw-r--r-- 1 root root    0 May 30 17:53 net.dict
-rw-r--r-- 1 root root 380K May 30 17:52 net.sql
drwxr-xr-x 2 root root 4.0K May 30 17:51 nvtx/
drwxr-xr-x 3 root root 4.0K May 30 17:53 parse/
drwxr-xr-x 2 root root 4.0K May 30 17:51 prof/

root@0526e15ca352:~/code/pyprof2/pyprof2# rm net.dict

root@0526e15ca352:~/code/pyprof2/pyprof2# parse/parse.py net.sql > net.dict
Found 0 kernels. Exiting.

root@0526e15ca352:~/code/pyprof2/pyprof2# prof/prof.py --csv net.dict > net.csv

root@0526e15ca352:~/code/pyprof2/pyprof2# ll net.csv
-rw-r--r-- 1 root root 66 May 30 17:54 net.csv

root@0526e15ca352:~/code/pyprof2/pyprof2# cat net.csv
"Idx","Direction","Sub","Module","Op","Kernel","Params","Sil(ns)"

Glad that you can get past the previous errors. PyProf is independent of any environments. It works on bare metal, docker containers and conda virualenvs. If you want to use docker (in response to your earlier question) I would suggest using Nvidia's docker containers. They have docker containers for almost all frameworks. https://ngc.nvidia.com/catalog/containers/nvidia:pytorch.

The error you are observing right now is also not a PyProf issue. Its the fact that in your environment NVprof doesnt have privileges to read GPU hardware counters (and hence the profile is empty). Please follow the instructions on the README page. Its a known problem (might no longer be a problem in the latest docker container, but not sure). https://github.com/adityaiitb/pyprof2#hardware-counters.

sudo within a docker container will not help. You can either fix the problem permanently or temporarily by launching the docker with --privileged option.

Running with docker's --privileged option, combined with the hardware counters fix worked! I was able to run your example code! Great stuff -- thanks!

Now, I ran the tool on my code, but I fail on pyprof2/prof/prof.py -w 150 net.dict. It seems when parsing out the args for Addmm(), you expect 3 args (reasonable) but somehow there are 5 args. No idea why that is the case, nor I attached a log showing the issue below. I also attached net.dict.
For more context, 64 is the batch size in this example, and I think this is trying to capture info about a Linear Linear (784 -> 500) layer. Not clear why there are int values mixed in with the tensors -- is this something to do with Transposes perhaps?

root@0db73276e3df:~/code/pyprof2# python -m pdb pyprof2/prof/prof.py -w 150 net.dict
> /root/code/pyprof2/pyprof2/prof/prof.py(14)<module>()
-> """
(Pdb) c
Idx     Direc Sub Module          Op              Kernel                                          Params                                          Sil(ns)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/pdb.py", line 1701, in main
    pdb._runscript(mainpyfile)
  File "/opt/conda/lib/python3.7/pdb.py", line 1570, in _runscript
    self.run(statement)
  File "/opt/conda/lib/python3.7/bdb.py", line 585, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/root/code/pyprof2/pyprof2/prof/prof.py", line 14, in <module>
    """
  File "/root/code/pyprof2/pyprof2/prof/prof.py", line 222, in main
    xx = foo(mod, op, d)
  File "/root/code/pyprof2/pyprof2/prof/prof.py", line 116, in foo
    xx = Addmm(d)
  File "pyprof2/prof/blas.py", line 41, in __init__
    assert (len(args) == 3)
AssertionError
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /root/code/pyprof2/pyprof2/prof/blas.py(41)__init__()
-> assert (len(args) == 3)
(Pdb) !args
[{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}, {'name': '', 'type': 'int', 'value': 1}, {'name': '', 'type': 'tensor', 'shape': (64, 784), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (784, 500), 'dtype': 'float32'}]
(Pdb)
Post mortem debugger finished. The pyprof2/prof/prof.py will be restarted
> /root/code/pyprof2/pyprof2/prof/prof.py(14)<module>()
-> """
(Pdb)
root@0db73276e3df:~/code/pyprof2# python --version
Python 3.7.4
root@0db73276e3df:~/code/pyprof2# pip --version
pip 19.3.1 from /opt/conda/lib/python3.7/site-packages/pip (python 3.7)

net.dict:

{'kShortName': 'sgemm_32x32x32_NN_vec', 'kDuration': 36062, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:36'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addmm_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}, {'name': '', 'type': 'int', 'value': 1}, {'name': '', 'type': 'tensor', 'shape': (64, 784), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (784, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 400', 'addmm_, seq = 401'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [401], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addmm_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (8, 4, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NN_vec'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1696, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:37'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'fill_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 1}]}"], 'seqMarker': ['linearUnified, seq = 400', 'fill_, seq = 401'], 'seqId': [401], 'subSeqId': 0, 'altSeqId': [400], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['fill_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'ger_kernel', 'kDuration': 3776, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:38'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addr_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 400', 'addr_, seq = 401'], 'seqId': [401], 'subSeqId': 0, 'altSeqId': [400], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addr_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (16, 2, 1), 'block': (256, 1, 1), 'kLongName': 'void ger_kernel<float, float, 256, 5, false>(cublasGerParams<float, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 2464, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/.local/lib/python3.7/site-packages/pyprof2/nvtx/nvmarker.py:95'], 'reprMarkers': [], 'marker': ["{'mod': 'torch.nn.functional', 'op': 'relu', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': 'inplace', 'type': 'bool', 'value': False}]}", "{'mod': 'torch', 'op': 'relu', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['relu, seq = 401'], 'seqId': [401], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['torch.nn.functional', 'torch'], 'op': ['relu', 'relu'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'sgemm_32x32x32_NN_vec', 'kDuration': 24031, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:36'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addmm_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}, {'name': '', 'type': 'int', 'value': 1}, {'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (500, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 402', 'addmm_, seq = 403'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [403], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addmm_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (8, 4, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NN_vec'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1184, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:37'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'fill_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 1}]}"], 'seqMarker': ['linearUnified, seq = 402', 'fill_, seq = 403'], 'seqId': [403], 'subSeqId': 0, 'altSeqId': [402], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['fill_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'ger_kernel', 'kDuration': 2944, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:38'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addr_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 402', 'addr_, seq = 403'], 'seqId': [403], 'subSeqId': 0, 'altSeqId': [402], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addr_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (16, 2, 1), 'block': (256, 1, 1), 'kLongName': 'void ger_kernel<float, float, 256, 5, false>(cublasGerParams<float, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1792, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/.local/lib/python3.7/site-packages/pyprof2/nvtx/nvmarker.py:95'], 'reprMarkers': [], 'marker': ["{'mod': 'torch.nn.functional', 'op': 'relu', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': 'inplace', 'type': 'bool', 'value': False}]}", "{'mod': 'torch', 'op': 'relu', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['relu, seq = 403'], 'seqId': [403], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['torch.nn.functional', 'torch'], 'op': ['relu', 'relu'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'sgemm_32x32x32_NN', 'kDuration': 27678, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:36'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addmm_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}, {'name': '', 'type': 'int', 'value': 1}, {'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (500, 10), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 404', 'addmm_, seq = 405'], 'seqId': [404], 'subSeqId': 0, 'altSeqId': [405], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addmm_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (2, 1, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NN'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1184, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:37'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'fill_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 1}]}"], 'seqMarker': ['linearUnified, seq = 404', 'fill_, seq = 405'], 'seqId': [405], 'subSeqId': 0, 'altSeqId': [404], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['fill_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'ger_kernel', 'kDuration': 2816, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/code/modules.py:43', '/root/code/functions.py:38'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'addr_', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (10,), 'dtype': 'float32'}]}"], 'seqMarker': ['linearUnified, seq = 404', 'addr_, seq = 405'], 'seqId': [405], 'subSeqId': 0, 'altSeqId': [404], 'dir': 'fprop', 'mod': ['Tensor'], 'op': ['addr_'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 2, 1), 'block': (256, 1, 1), 'kLongName': 'void ger_kernel<float, float, 256, 5, false>(cublasGerParams<float, float>)'}
{'kShortName': 'softmax_warp_forward', 'kDuration': 3743, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:134', '/root/code/model.py:76', '/root/.local/lib/python3.7/site-packages/pyprof2/nvtx/nvmarker.py:95'], 'reprMarkers': [], 'marker': ["{'mod': 'torch.nn.functional', 'op': 'log_softmax', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}]}", "{'mod': 'Tensor', 'op': 'log_softmax', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 1}]}"], 'seqMarker': ['_log_softmax, seq = 405'], 'seqId': [405], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['torch.nn.functional', 'Tensor'], 'op': ['log_softmax', 'log_softmax'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (4, 1, 1), 'block': (16, 8, 1), 'kLongName': 'void (anonymous namespace)::softmax_warp_forward<float, float, float, 4, true>(float*, float const*, int, int, int)'}
{'kShortName': 'cunn_ClassNLLCriterion_updateOutput_kernel', 'kDuration': 7776, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:135'], 'reprMarkers': [], 'marker': ["{'mod': 'torch.nn.functional', 'op': 'nll_loss', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'int64'}]}"], 'seqMarker': ['nll_loss, seq = 406'], 'seqId': [406], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['torch.nn.functional'], 'op': ['nll_loss'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (32, 1, 1), 'kLongName': 'void cunn_ClassNLLCriterion_updateOutput_kernel<float, float>(float*, float*, float*, long*, float*, int, int, int, int, long)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 3040, 'layer': [], 'trace': ['main.py:412', 'main.py:322', '/root/code/util.py:230', '/root/code/util.py:141', '/root/.local/lib/python3.7/site-packages/pyprof2/nvtx/nvmarker.py:95'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'backward', 'args': [{'name': '', 'type': 'float', 'value': 0.35390961170196533}]}", "{'mod': 'torch', 'op': 'ones_like', 'args': [{'name': '', 'type': 'float', 'value': 0.35390961170196533}]}"], 'seqMarker': [], 'seqId': [], 'subSeqId': 0, 'altSeqId': [], 'dir': 'fprop', 'mod': ['Tensor', 'torch'], 'op': ['backward', 'ones_like'], 'tid': 1640458048, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<128, 4, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#4}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#4})'}
{'kShortName': 'cunn_ClassNLLCriterion_updateGradInput_kernel', 'kDuration': 4928, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['nll_loss_backward, seq = 0', 'NllLossBackward, seq = 406'], 'seqId': [406], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na', 'na'], 'op': ['nll_loss', 'NllLoss'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (32, 1, 1), 'kLongName': 'void cunn_ClassNLLCriterion_updateGradInput_kernel<float>(float*, float*, long*, float*, float*, int, int, int, int, long)'}
{'kShortName': 'softmax_warp_backward', 'kDuration': 2784, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['_log_softmax_backward_data, seq = 0', 'LogSoftmaxBackward, seq = 405'], 'seqId': [405], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na'], 'op': ['LogSoftmax'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (4, 1, 1), 'block': (16, 8, 1), 'kLongName': 'void (anonymous namespace)::softmax_warp_backward<float, float, float, 4, true>(float*, float const*, float const*, int, int, int)'}
{'kShortName': 'sgemm_32x32x32_NT', 'kDuration': 13311, 'layer': [], 'trace': ['/root/code/functions.py:68'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (10, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 404'], 'seqId': [404], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (8, 4, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NT'}
{'kShortName': 'sgemm_32x32x32_TN', 'kDuration': 12959, 'layer': [], 'trace': ['/root/code/functions.py:70'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64, 10), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 404'], 'seqId': [404], 'subSeqId': 1, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (4, 1, 4), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_TN'}
{'kShortName': 'gemv2N_kernel', 'kDuration': 6656, 'layer': [], 'trace': ['/root/code/functions.py:72'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mv', 'args': [{'name': '', 'type': 'tensor', 'shape': (10, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}]}"], 'seqMarker': ['mv, seq = 0', 'linearUnifiedLegacyBackward, seq = 404'], 'seqId': [404], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mv'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (3, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void gemv2N_kernel<int, int, float, float, float, 128, 32, 4, 4, 1, cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float> >(cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 2112, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 0, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (10, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1600, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 1, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1920, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['threshold_backward, seq = 0', 'ReluBackward0, seq = 403'], 'seqId': [403], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na'], 'op': ['threshold'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 2048, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'abs', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['abs, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['abs'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1}>(at::TensorIterator&, at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1}>(at::TensorIterator&, at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'reduce_kernel', 'kDuration': 13567, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'sum', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}]}"], 'seqMarker': ['sum, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['sum'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (32, 16, 1), 'kLongName': 'void at::native::reduce_kernel<512, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_kernel_impl<float, float, float>(at::TensorIterator&)::{lambda(float, float)#1}>, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_kernel_impl<float, float, float>(at::TensorIterator&)::{lambda(float, float)#1}>, unsigned int, float, 4>)'}
{'kShortName': 'gatherTopK', 'kDuration': 25663, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'topk', 'args': [{'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 80}]}"], 'seqMarker': ['topk, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['topk'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void gatherTopK<float, unsigned int, 1, true>(TensorInfo<float, unsigned int>, unsigned int, unsigned int, unsigned int, unsigned int, TensorInfo<float, unsigned int>, unsigned int, unsigned int, TensorInfo<long, unsigned int>, unsigned int)'}
{'kShortName': 'bitonicSortKVInPlace', 'kDuration': 19487, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'topk', 'args': [{'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 80}]}"], 'seqMarker': ['topk, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 1, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['topk'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (64, 1, 1), 'kLongName': 'void bitonicSortKVInPlace<float, long, -2, -1, GTComp<float, true>, unsigned int, 128>(TensorInfo<float, unsigned int>, unsigned int, unsigned int, unsigned int, TensorInfo<long, unsigned int>, unsigned int, GTComp<float, true>)'}
{'kShortName': 'indexSelectLargeIndex', 'kDuration': 3488, 'layer': [], 'trace': ['/root/code/functions.py:54'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_select', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}]}"], 'seqMarker': ['index_select, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_select'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (40, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexSelectLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'indexSelectLargeIndex', 'kDuration': 3648, 'layer': [], 'trace': ['/root/code/functions.py:59'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_select', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}]}"], 'seqMarker': ['index_select, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 1, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_select'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (313, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexSelectLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'sgemm_32x32x32_NT_vec', 'kDuration': 13311, 'layer': [], 'trace': ['/root/code/functions.py:59'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 80), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (80, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (8, 4, 1), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_NT_vec'}
{'kShortName': 'elementwise_kernel', 'kDuration': 3520, 'layer': [], 'trace': ['/root/code/functions.py:61'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'zero_', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['zero_, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['zero_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (489, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'sgemm_32x32x32_TN_vec', 'kDuration': 12128, 'layer': [], 'trace': ['/root/code/functions.py:62'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64, 80), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (12, 1, 4), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_TN_vec'}
{'kShortName': 'indexCopyLargeIndex', 'kDuration': 6400, 'layer': [], 'trace': ['/root/code/functions.py:62'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_copy_', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}, {'name': '', 'type': 'tensor', 'shape': (500, 80), 'dtype': 'float32'}]}"], 'seqMarker': ['index_copy_, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_copy_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (313, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexCopyLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'gemv2N_kernel', 'kDuration': 4320, 'layer': [], 'trace': ['/root/code/functions.py:65'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mv', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}]}"], 'seqMarker': ['mv, seq = 0', 'linearUnifiedLegacyBackward, seq = 402'], 'seqId': [402], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mv'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (125, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void gemv2N_kernel<int, int, float, float, float, 128, 32, 4, 4, 1, cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float> >(cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 5344, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 0, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (489, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1696, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 1, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 2304, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['threshold_backward, seq = 0', 'ReluBackward0, seq = 401'], 'seqId': [401], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na'], 'op': ['threshold'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1}>(at::TensorIterator&, at::native::threshold_kernel_impl<float>(at::TensorIterator&, float, float)::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1696, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'abs', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['abs, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['abs'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (63, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1}>(at::TensorIterator&, at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1}>(at::TensorIterator&, at::native::abs_kernel_cuda(at::TensorIterator&)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'reduce_kernel', 'kDuration': 9919, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'sum', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 0}]}"], 'seqMarker': ['sum, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['sum'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (32, 16, 1), 'kLongName': 'void at::native::reduce_kernel<512, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_kernel_impl<float, float, float>(at::TensorIterator&)::{lambda(float, float)#1}>, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_kernel_impl<float, float, float>(at::TensorIterator&)::{lambda(float, float)#1}>, unsigned int, float, 4>)'}
{'kShortName': 'gatherTopK', 'kDuration': 15423, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'topk', 'args': [{'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 80}]}"], 'seqMarker': ['topk, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['topk'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void gatherTopK<float, unsigned int, 1, true>(TensorInfo<float, unsigned int>, unsigned int, unsigned int, unsigned int, unsigned int, TensorInfo<float, unsigned int>, unsigned int, unsigned int, TensorInfo<long, unsigned int>, unsigned int)'}
{'kShortName': 'bitonicSortKVInPlace', 'kDuration': 12224, 'layer': [], 'trace': ['/root/code/functions.py:51'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'topk', 'args': [{'name': '', 'type': 'tensor', 'shape': (500,), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': 80}]}"], 'seqMarker': ['topk, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 1, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['topk'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (64, 1, 1), 'kLongName': 'void bitonicSortKVInPlace<float, long, -2, -1, GTComp<float, true>, unsigned int, 128>(TensorInfo<float, unsigned int>, unsigned int, unsigned int, unsigned int, TensorInfo<long, unsigned int>, unsigned int, GTComp<float, true>)'}
{'kShortName': 'indexSelectLargeIndex', 'kDuration': 2752, 'layer': [], 'trace': ['/root/code/functions.py:54'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_select', 'args': [{'name': '', 'type': 'tensor', 'shape': (64, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}]}"], 'seqMarker': ['index_select, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_select'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (40, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexSelectLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 4639, 'layer': [], 'trace': ['/root/code/functions.py:61'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'zero_', 'args': [{'name': '', 'type': 'tensor', 'shape': (784, 500), 'dtype': 'float32'}]}"], 'seqMarker': ['zero_, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['zero_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (766, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'sgemm_32x32x32_TN_vec', 'kDuration': 6976, 'layer': [], 'trace': ['/root/code/functions.py:62'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mm', 'args': [{'name': '', 'type': 'tensor', 'shape': (784, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64, 80), 'dtype': 'float32'}]}"], 'seqMarker': ['mm, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mm'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (15, 1, 5), 'block': (128, 1, 1), 'kLongName': 'sgemm_32x32x32_TN_vec'}
{'kShortName': 'indexCopyLargeIndex', 'kDuration': 6560, 'layer': [], 'trace': ['/root/code/functions.py:62'], 'reprMarkers': [], 'marker': ["{'mod': 'Tensor', 'op': 'index_copy_', 'args': [{'name': '', 'type': 'tensor', 'shape': (784, 500), 'dtype': 'float32'}, {'name': '', 'type': 'int', 'value': -1}, {'name': '', 'type': 'tensor', 'shape': (80,), 'dtype': 'int64'}, {'name': '', 'type': 'tensor', 'shape': (784, 80), 'dtype': 'float32'}]}"], 'seqMarker': ['index_copy_, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['Tensor'], 'op': ['index_copy_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (448, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void indexCopyLargeIndex<float, unsigned int, 2, 2, -2, false>(TensorInfo<float, unsigned int>, TensorInfo<float, unsigned int>, TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long)'}
{'kShortName': 'gemv2N_kernel', 'kDuration': 3968, 'layer': [], 'trace': ['/root/code/functions.py:65'], 'reprMarkers': [], 'marker': ["{'mod': 'torch', 'op': 'mv', 'args': [{'name': '', 'type': 'tensor', 'shape': (500, 64), 'dtype': 'float32'}, {'name': '', 'type': 'tensor', 'shape': (64,), 'dtype': 'float32'}]}"], 'seqMarker': ['mv, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['torch'], 'op': ['mv'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (125, 1, 1), 'block': (128, 1, 1), 'kLongName': 'void gemv2N_kernel<int, int, float, float, float, 128, 32, 4, 4, 1, cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float> >(cublasGemvParams<cublasGemvTensor<float const>, cublasGemvTensor<float>, float>)'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1471, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['zero_, seq = 0', 'zeros, seq = 0', 'linearUnifiedLegacyBackward, seq = 400'], 'seqId': [400], 'subSeqId': 0, 'altSeqId': [], 'dir': 'bprop', 'mod': ['na'], 'op': ['linearUnifiedLegacy'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (98, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1}>(at::TensorIterator&, at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 8767, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 0, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (766, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}
{'kShortName': 'elementwise_kernel', 'kDuration': 1728, 'layer': [], 'trace': [], 'reprMarkers': [], 'marker': [], 'seqMarker': ['add_, seq = 0'], 'seqId': [], 'subSeqId': 1, 'altSeqId': [0], 'dir': 'fprop', 'mod': ['na'], 'op': ['add_'], 'tid': 3120559872, 'device': 0, 'stream': 7, 'grid': (1, 1, 1), 'block': (512, 1, 1), 'kLongName': 'void at::native::elementwise_kernel<512, 1, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2}>(int, at::native::gpu_kernel_impl<at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1}>(at::TensorIterator&, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda(float, float)#1} const&)::{lambda(int)#2})'}

Great to know you can run hello world! Thanks for trying out the tool and pointing out the bug.

  • I just fixed the bug and now you should be able to run prof.py.
  • Addmm can receive 5 arguments, 3 tensors and 2 scalars. The two scalars are alpha and beta. The tool captures all arguments.
  • The bug arose because PyTorch is a fast moving target and sometimes APIs can change slightly between versions causing bugs / assertions. E.g. the function signature of addmm did change slightly from 1.2 to 1.5. I was unaware until you pointed it out.

v1.2: https://pytorch.org/docs/1.2.0/torch.html#torch.addmm
v1.5: https://pytorch.org/docs/stable/torch.html#torch.addmm

Once you confirm, I can close the bug.

@adityaiitb -- The fix seems to work, I'm able to generate csv files of the profiling results. Thanks!