Training and evaluation works well when I change MAX_POINTS_PER_VOXEL from 32 to 100, but after converting model to ONNX, TFRT inference results in wrong predictions

Dataset: I am using a custom dataset with npy files and annotations. I followed all steps required for custom dataset preparation and I am able to get great results with pytorch with 90% map on my eval set.

With MAX_POINTS_PER_VOXEL at the default value of 32, I am getting good results during EVAL, and more or less the same predictions during TFRT inference as well.

After this, I increased my MAX_POINTS_PER_VOXEL to 100 for better performance. I see better results during evaluation phase. However, when I convert the model to onnx and perform TFRT inference, I am seeing wrong predictions.

Export script evolution:
In regard to the export process, exporter.py and simplifier_onnx.py are being used in the script. However, both scripts are hardcoded for 3 classes for kitti dataset. I have just one class to detect. Hence, I referred to the following commit to make the onnx export work: https://github.com/NVIDIA-AI-IOT/CUDA-PointPillars/pull/77/commits.
After this , I was able to export but I faced the following issue after this: #82. I resolved this by tinkering with the export script, as mentioned on the following comment: #77 (comment).

Post this, I have also changed the hard coded MAX_VOXELS from 10000 to instead accept from the config file (40000).

I believe there are still bugs in the export script lurking in the shadows. PLEASE LOOK INTO THIS

I am pasting my export script for reference:

exporter.py file

# SPDX-FileCopyrightText: Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import glob
import onnx
import torch
import argparse
import numpy as np

from pathlib import Path
from onnxsim import simplify
from pcdet.utils import common_utils
from pcdet.models import build_network
from pcdet.datasets import DatasetTemplate
from pcdet.config import cfg, cfg_from_yaml_file

from exporter_paramters import export_paramters as export_paramters
from simplifier_onnx import simplify_preprocess, simplify_postprocess

class DemoDataset(DatasetTemplate):
    def __init__(self, dataset_cfg, class_names, training=True, root_path=None, logger=None, ext='.bin'):
        """
        Args:
            root_path:
            dataset_cfg:
            class_names:
            training:
            logger:
        """
        super().__init__(
            dataset_cfg=dataset_cfg, class_names=class_names, training=training, root_path=root_path, logger=logger
        )
        self.root_path = root_path
        self.ext = ext
        data_file_list = glob.glob(str(root_path / f'*{self.ext}')) if self.root_path.is_dir() else [self.root_path]

        data_file_list.sort()
        self.sample_file_list = data_file_list

    def __len__(self):
        return len(self.sample_file_list)

    def __getitem__(self, index):
        if self.ext == '.bin':
            points = np.fromfile(self.sample_file_list[index], dtype=np.float32).reshape(-1, 4)
        elif self.ext == '.npy':
            points = np.load(self.sample_file_list[index])
        else:
            raise NotImplementedError

        input_dict = {
            'points': points,
            'frame_id': index,
        }

        data_dict = self.prepare_data(data_dict=input_dict)
        return data_dict

def parse_config():
    parser = argparse.ArgumentParser(description='arg parser')
    parser.add_argument('--cfg_file', type=str, default='cfgs/kitti_models/pointpillar.yaml',
                        help='specify the config for demo')
    parser.add_argument('--data_path', type=str, default='demo_data',
                        help='specify the point cloud data file or directory')
    parser.add_argument('--ckpt', type=str, default=None, help='specify the pretrained model')
    parser.add_argument('--ext', type=str, default='.bin', help='specify the extension of your point cloud data file')

    args = parser.parse_args()

    cfg_from_yaml_file(args.cfg_file, cfg)

    return args, cfg

def main():
    args, cfg = parse_config()
    export_paramters(cfg)
    logger = common_utils.create_logger()
    logger.info('------ Convert OpenPCDet model for TensorRT ------')
    demo_dataset = DemoDataset(
        dataset_cfg=cfg.DATA_CONFIG, class_names=cfg.CLASS_NAMES, training=False,
        root_path=Path(args.data_path), ext=args.ext, logger=logger
    )

    model = build_network(model_cfg=cfg.MODEL, num_class=len(cfg.CLASS_NAMES), dataset=demo_dataset)
    model.load_params_from_file(filename=args.ckpt, logger=logger, to_cpu=True)
    model.cuda()
    model.eval()
    np.set_printoptions(threshold=np.inf)
    with torch.no_grad():

        # MAX_VOXELS = 10000
        NUMBER_OF_CLASSES = len(cfg.CLASS_NAMES)
        MAX_POINTS_PER_VOXEL = None

        DATA_PROCESSOR = cfg.DATA_CONFIG.DATA_PROCESSOR
        POINT_CLOUD_RANGE = cfg.DATA_CONFIG.POINT_CLOUD_RANGE

        for i in DATA_PROCESSOR:
            if i['NAME'] == "transform_points_to_voxels":
                MAX_POINTS_PER_VOXEL = i['MAX_POINTS_PER_VOXEL']
                VOXEL_SIZES = i['VOXEL_SIZE']
                MAX_VOXELS = i['MAX_NUMBER_OF_VOXELS']['test']
                break

        print("ra35 DEBUG MAX_POINTS_PER_VOXEL, VOXEL_SIZES, MAX_VOXELS ", MAX_POINTS_PER_VOXEL, VOXEL_SIZES, MAX_VOXELS)

        if MAX_POINTS_PER_VOXEL == None:
            logger.info('Could Not Parse Config... Exiting')
            import sys
            sys.exit()

        VOXEL_SIZE_X = abs(POINT_CLOUD_RANGE[0] - POINT_CLOUD_RANGE[3]) / VOXEL_SIZES[0]
        VOXEL_SIZE_Y = abs(POINT_CLOUD_RANGE[1] - POINT_CLOUD_RANGE[4]) / VOXEL_SIZES[1]

        FEATURE_SIZE_X = VOXEL_SIZE_X / 2  # Is this number of bins?
        FEATURE_SIZE_Y = VOXEL_SIZE_Y / 2

        print("ra35 DEBUG FEATURE_SIZE_X FEATURE_SIZE_Y ", FEATURE_SIZE_X, FEATURE_SIZE_Y)

        dummy_voxels = torch.zeros(
          (MAX_VOXELS, MAX_POINTS_PER_VOXEL, 4),
          dtype=torch.float32,
          device='cuda:0')

        dummy_voxel_idxs = torch.zeros(
          (MAX_VOXELS, 4),
          dtype=torch.int32,
          device='cuda:0')

        dummy_voxel_num = torch.zeros(
          (1),
          dtype=torch.int32,
          device='cuda:0')
        print("ra35 DEBUG MAX_VOXELS  MAX_POINTS_PER_VOXEL", MAX_VOXELS, MAX_POINTS_PER_VOXEL)
        dummy_input = dict()
        dummy_input['voxels'] = dummy_voxels
        dummy_input['voxel_num_points'] = dummy_voxel_num
        dummy_input['voxel_coords'] = dummy_voxel_idxs
        dummy_input['batch_size'] = torch.tensor(1)

        torch.onnx.export(model,       # model being run
          dummy_input,               # model input (or a tuple for multiple inputs)
          "./pointpillar_raw.onnx",  # where to save the model (can be a file or file-like object)
          export_params=True,        # store the trained parameter weights inside the model file
          opset_version=11,          # the ONNX version to export the model to
          do_constant_folding=True,  # whether to execute constant folding for optimization
          keep_initializers_as_inputs=True,
          input_names = ['voxels', 'voxel_num', 'voxel_idxs'],   # the model's input names
          output_names = ['cls_preds', 'box_preds', 'dir_cls_preds'], # the model's output names
          )

        onnx_raw = onnx.load("./pointpillar_raw.onnx")  # load onnx model
        onnx_trim_post = simplify_postprocess(onnx_raw, FEATURE_SIZE_X, FEATURE_SIZE_Y, NUMBER_OF_CLASSES)

        onnx_simp, check = simplify(onnx_trim_post)
        assert check, "Simplified ONNX model could not be validated"

        onnx_final = simplify_preprocess(onnx_simp, VOXEL_SIZE_Y, VOXEL_SIZE_X, MAX_POINTS_PER_VOXEL)
        onnx.save(onnx_final, "pointpillar.onnx")
        print('finished exporting onnx')

    logger.info('[PASS] ONNX EXPORTED.')


if __name__ == '__main__':
    main()

simplifier_onnx.py

# SPDX-FileCopyrightText: Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import onnx
import numpy as np
import onnx_graphsurgeon as gs

@gs.Graph.register()
def replace_with_clip(self, inputs, outputs,  voxel_array):
    for inp in inputs:
        inp.outputs.clear()

    for out in outputs:
        out.inputs.clear()

    op_attrs = dict()
    op_attrs["dense_shape"] =  voxel_array

    return self.layer(name="PPScatter_0", op="PPScatterPlugin", inputs=inputs, outputs=outputs, attrs=op_attrs)


def loop_node(graph, current_node, loop_time=0):
    for i in range(loop_time):
        next_node = [node for node in graph.nodes if len(node.inputs) != 0 and len(current_node.outputs) != 0 and node.inputs[0] == current_node.outputs[0]][0]
        current_node = next_node
    return next_node


def simplify_postprocess(onnx_model, FEATURE_SIZE_X, FEATURE_SIZE_Y, NUMBER_OF_CLASSES):
    print("Use onnx_graphsurgeon to adjust postprocessing part in the onnx...")
    graph = gs.import_onnx(onnx_model)

    cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, int(FEATURE_SIZE_Y), int(FEATURE_SIZE_X), 2 * NUMBER_OF_CLASSES * NUMBER_OF_CLASSES))
    box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, int(FEATURE_SIZE_Y), int(FEATURE_SIZE_X), 14 * NUMBER_OF_CLASSES))
    dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, int(FEATURE_SIZE_Y), int(FEATURE_SIZE_X), 4 * NUMBER_OF_CLASSES))

    tmap = graph.tensors()
    new_inputs = [tmap["voxels"], tmap["voxel_idxs"], tmap["voxel_num"]]
    new_outputs = [cls_preds, box_preds, dir_cls_preds]

    for inp in graph.inputs:
      if inp not in new_inputs:
        inp.outputs.clear()

    for out in graph.outputs:
      out.inputs.clear()

    first_ConvTranspose_node = [node for node in graph.nodes if node.op == "ConvTranspose"][0]
    concat_node = loop_node(graph, first_ConvTranspose_node, 3)
    assert concat_node.op == "Concat"

    first_node_after_concat = [node for node in graph.nodes if len(node.inputs) != 0 and len(concat_node.outputs) != 0 and node.inputs[0] == concat_node.outputs[0]]

    for i in range(3):
        transpose_node = loop_node(graph, first_node_after_concat[i], 1)
        assert transpose_node.op == "Transpose"
        transpose_node.outputs = [new_outputs[i]]

    graph.inputs = new_inputs
    graph.outputs = new_outputs
    graph.cleanup().toposort()

    return gs.export_onnx(graph)


def simplify_preprocess(onnx_model, VOXEL_SIZE_Y, VOXEL_SIZE_X, MAX_POINTS_PER_VOXEL):
    print("Use onnx_graphsurgeon to modify onnx...")
    graph = gs.import_onnx(onnx_model)

    tmap = graph.tensors()
    MAX_VOXELS = tmap["voxels"].shape[0]
    print("ra35 DEBUG VOXEL_SIZE_Y, VOXEL_SIZE_X ", VOXEL_SIZE_Y, VOXEL_SIZE_X)

    VOXEL_ARRAY = np.array([int(VOXEL_SIZE_Y), int(VOXEL_SIZE_X)])

    # voxels: [V, P, C']
    # V is the maximum number of voxels per frame
    # P is the maximum number of points per voxel
    # C' is the number of channels(features) per point in voxels.
    input_new = gs.Variable(name="voxels", dtype=np.float32, shape=(MAX_VOXELS, MAX_POINTS_PER_VOXEL, 10))

    # voxel_idxs: [V, 4]
    # V is the maximum number of voxels per frame
    # 4 is just the length of indexs encoded as (frame_id, z, y, x).
    X = gs.Variable(name="voxel_idxs", dtype=np.int32, shape=(MAX_VOXELS, 4))

    # voxel_num: [1]
    # Gives valid voxels number for each frame
    Y = gs.Variable(name="voxel_num", dtype=np.int32, shape=(1,))

    first_node_after_pillarscatter = [node for node in graph.nodes if node.op == "Conv"][0]

    first_node_pillarvfe = [node for node in graph.nodes if node.op == "MatMul"][0]

    next_node = current_node = first_node_pillarvfe
    for i in range(6):
        next_node = [node for node in graph.nodes if node.inputs[0] == current_node.outputs[0]][0]
        if i == 5:              # ReduceMax
            current_node.attrs['keepdims'] = [0]
            break
        current_node = next_node

    last_node_pillarvfe = current_node

    #merge some layers into one layer between inputs and outputs as below
    graph.inputs.append(Y)
    inputs = [last_node_pillarvfe.outputs[0], X, Y]
    outputs = [first_node_after_pillarscatter.inputs[0]]
    graph.replace_with_clip(inputs, outputs,  VOXEL_ARRAY)

    # Remove the now-dangling subgraph.
    graph.cleanup().toposort()

    #just keep some layers between inputs and outputs as below
    graph.inputs = [first_node_pillarvfe.inputs[0] , X, Y]
    graph.outputs = [tmap["cls_preds"], tmap["box_preds"], tmap["dir_cls_preds"]]

    graph.cleanup()

    #Rename the first tensor for the first layer
    graph.inputs = [input_new, X, Y]
    first_add = [node for node in graph.nodes if node.op == "MatMul"][0]
    first_add.inputs[0] = input_new

    graph.cleanup().toposort()

    return gs.export_onnx(graph)


if __name__ == '__main__':
    mode_file = "pointpillar-native-sim.onnx"
    simplify_preprocess(onnx.load(mode_file))

Hello, I am facing the same problem.

Since my point clouds are very dense, I increased my MAX_POINTS_PER_VOXEL to 200, which leads to much better results with PyTorch. However, after onnx-conversion followed by TFRT inference, all predicted bounding boxes are wrong.

Hi,

Is there any update on this issue ?

@Allamrahul I am now facing the same issue. Did you find a solution? If I could across one I will share it here.

Here is the comparison of an onnx model with 64 and 32 point features. The 32 works in tensorrt inference but the 64 does not. Looking at the topology of these models it looks to me that the onnx export is successful and the issue lies somewhere in the cuda code. I have tried adjusting the hard coded values in kernel.h and had the same results. I am now training a model with 64 point features and 128 bev features to see if it assumes the bev features are double the point features.

Update: I can get valid results by adjusting the WARP_SIZE

CUDA-PointPillars/include/kernel.h

Line 29 in 1218564

const int WARP_SIZE = 32; // one warp(32 threads) for one pillar

and the value of 32 here

CUDA-PointPillars/src/preprocess_kernels.cu

Line 197 in 1218564

int pillar_idx_inBlock = threadIdx.x/32;

I guess the points per voxel is somehow tied to the number of threads that are used. By changing both these values to the points per voxel value I can get correct results

I am using 64 point features and exceeding this value will raise cuda errors but it gives a good place to start debugging further