Deep learning shines brightly in computer vision, and many problems that traditional methods cannot solve are being overcome one by one. However, the high computational cost also dramatically limits the use of deep learning, and its computationally intensive characteristics are particularly prominent on platforms with limited computing resources such as mobile devices and embedded devices. High performance is the pursuit of modern deep learning training frameworks. Researchers use many operators when implementing algorithms, and many of these operators are non-computational, such as transpose, zip, conditional slice, irregular split/concatenate, and top-k. The fine-grained operator takes more and more time in the model training process. There are few operators in the classification model, and the operators in the detection and other models take up even more than 50% of the time. Therefore, the high performance of the training framework relies heavily on the overall optimization of these continuous operators. In this article, the team members will specify the compilation platform, architecture, and instruction set during the compilation process. It can make the compilation as close as possible to the characteristics of the Arm architecture. For the optimized compilation of Arm processors and their instruction sets, the experimental results will be compared with the performance indicators of the algorithms compiled from general instructions.
CedricHwong/Practical-research-on-a-high-performance-convolution-operator-on-ARM-Processors-
Deep learning shines brightly in computer vision, and many problems that traditional methods cannot solve are being overcome one by one. However, the high computational cost also dramatically limits the use of deep learning, and its computationally intensive characteristics are particularly prominent on platforms with limited computing resources such as mobile devices and embedded devices. High performance is the pursuit of modern deep learning training frameworks. Researchers use many operators when implementing algorithms, and many of these operators are non-computational, such as transpose, zip, conditional slice, irregular split/concatenate, and top-k. The fine-grained operator takes more and more time in the model training process. There are few operators in the classification model, and the operators in the detection and other models take up even more than 50% of the time. Therefore, the high performance of the training framework relies heavily on the overall optimization of these continuous operators. In this article, the team members will specify the compilation platform, architecture, and instruction set during the compilation process. It can make the compilation as close as possible to the characteristics of the Arm architecture. For the optimized compilation of Arm processors and their instruction sets, the experimental results will be compared with the performance indicators of the algorithms compiled from general instructions.
HTML