Papers

layout

Page segmentation using convolutional neural network and graphical model -Lixiaohui, DAS2020
Printed/Handwritten Texts and Graphics Separation in Complex Documents Using Conditional Random Fields -LiXiaohui, DAS2018

asr

Conformer: Convolution-augmented Transformer for Speech Recognition -google, Interspeech2020, code1,code2
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context -google, Interspeech2020, code
Investigation of modeling units for mandarin speech recognition using DFSMN-CTC-sMBR -alibaba, ICASSP2019
Sequence discriminative distributed training of long short-term memory recurrent neural networks -google,
Sequence-discriminative training of deep neural networks INTERSPEECH2013

Contextual Biasing

Improved recognition of contact names in voice commands -google, ICASSP2015
Bringing contextual information to google speech recognition -google, INTERSPEECH2015
Shallow-Fusion End-to-End Contextual Biasing -google, INTERSPEECH2019
Streaming End-to-end Speech Recognition for Mobile Devices -google, ICASSP2019

table detection & recognition

Robust Table Detection and Structure Recognition from Heterogeneous Document Images -huoqiang, arxiv2022
Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images -arxiv2021
Guided Table Structure Recognition through Anchor Optimization -arxiv2021
TabAug: Data Driven Augmentation for Enhanced Table Structure Recognition -ICDAR2021
PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Table Image Recognition to Latex -pingan, arxiv2021
LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment -ICDAR2021
ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX -arxiv2021
TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition JD-ICCV2021, code
Parsing Table Structures in the Wild -alibaba, ICCV2021, dataset
TNCR: Table Net Detection and Classification Dataset -arxiv2021, dataset
Form2Seq : A Framework for Higher-Order Form Structure Extraction -EMNLP2020
Table Structure Recognition using Top-Down and Bottom-Up Cues ECCV2020
Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images -ICDAR2019
Image-based table recognition: data, model, and evaluation -arxiv2019
Deep Splitting and Merging for Table Structure Decomposition -ICDAR2019
Deepdesrt: Deep learning for detection and structure recognition of tables in document images -ICDAR2017

mathematical expression recognition

When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition -baixiang, ECCV2022, code
Handwritten Mathematical Expression Recognition via Attention Aggregation based Bi-directional Mutual Learning -tencent, AAAI2022, code
Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer -huawei, ICDAR2021
Graph-to-graph: towards accurate and interpretable online handwritten mathematical expression recognition -wujinwen, AAAI2021
ICFHR 2020 Competition on Offline Recognition and Spotting of Handwritten Mathematical Expressions - OffRaSHME -Wangdahan, ICFHR2020
Improvement of End-to-End Offline Handwritten Mathematical Expression Recognition by Weakly Supervised Learning -ICFHR2020
Improving Attention-Based Handwritten Mathematical Expression Recognition with Scale Augmentation and Drop Attention -Jinlianwen, ICFHR2020
EDSL: An Encoder-Decoder Architecture with Symbol-Level Features for Printed Mathematical Expression Recognition -arxiv2020, code
Handwritten mathematical expression recognition via paired adversarial learning -WuJinwen, IJCV2020
Stroke Constrained Attention Network for Online Handwritten Mathematical Expression Recognition -DuJun, arxiv2020, code
SRD: A Tree Structure Based Decoder for Online Handwritten Mathematical Expression Recognition -DuJun, TMM2020, code
A Tree-Structured Decoder for Image-to-Markup Generation -Dujun, ICML2020, code
Multi-modal Attention Network for Handwritten Mathematical Expression Recognition -DuJun, ICDAR2019
Robust Encoder-Decoder Learning Framework towards Offline Handwritten Mathematical Expression Recognition Based on Multi-Scale Deep Neural Network-arxiv2019
Track, attend, and parse (tap): An end-to-end framework for online handwritten mathematical expression recognition -DuJun, TMM2018, code
Multi-scale attention with dense encoder for handwritten mathematical expression recognition -DuJun, ICPR2018, code
Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition -J Zhang, J Du, S Zhang, D Liu, Y Hu, J Hu, S Wei, PR2017, code, code2
A GRU-Based Encoder-Decoder Approach with Attention for Online Handwritten Mathematical Expression Recognition -Dujun, ICDAR2017, code
Image-to-markup generation with coarse-to-fine attention -Dengyuntian, ICML2017, code
What you get is what you see: A visual markup decompiler -Dengyuntian, arxiv2016, code
Context-aware mathematical expression recognition: An end-to-end framework and a benchmark -Hewenhao, ICPR2016
ICFHR2016 CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions -ICFHR2016
An integrated grammar-based approach for mathematical expression recognition -PR2016

word_vector

Enriching Word Vectors with Subword Information--FAIR, arxiv2016
fastText-Bag of Tricks for Efficient Text Classification-FAIR, arxiv2016
An empirical evaluation of doc2vec with practical insights into document embedding generation-Jey Han Lau, Timothy Baldwin, arxiv2016
TagSpace:Semantic Embeddings from Hashtags-FAIR, EMNLP2014
doc2vec-Distributed Representations of Sentences and Documents-google, ICML2014
word2vec-Efficient estimation of word representations in vector space-google, ICLR2013

Seq2Seq

Convolutional Sequence to Sequence Learning -FAIR, arxiv2017
A Convolutional Encoder Model for Neural Machine Translation-FAIR, arxiv2016
Sequence level training with recurrent neural networks-FAIR, ICLR2016

ReID

Alignedreid: Surpassing human-level performance in person re-identification -Face++, arxiv2017

PoseEstimation

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields -CMU, CVPR2017
AlphaPose

EdgeDetection

Pixel Difference Networks for Efficient Edge Detection -ICCV2021, code
Richer Convolutional Features for Edge Detection -YunLiu, ..., Baixiang et, PAMI2019
Deepedge: A multi-scale bifurcated deep network for top-down contour detection -Gedas, CVPR15

line segmentation

M-LSD: Towards Light-weight and Real-time Line Segment Detection -NAVER, AAAI2022, code
Deep Hough Transform for Semantic Line Detection -PAMI2021, code
Holistically-Attracted Wireframe Parsing -CVPR2020, code
EDlines

video_classification

Learnable pooling with Context Gating for video classification -A. Miech, et al, TPAMI2018, Youtube8M-Competition-Top1
Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text -Zhe wang, et al, arxiv2017

dnn_base

Group Normalization -Kaiming He, et al, arxiv2018
Graph Convolutional Network -Xiaolong Wang, Yufei Ye, Abhinav Gupta, CVPR2018
DetNAS: Backbone Search for Object Detection
Mixup

light network

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning -ICLR2022,code
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications -arxiv2022, code
EfficientFormer: Vision Transformers at MobileNet Speed -apple, arxiv2022, code
UNeXt: MLP-based Rapid Medical Image Segmentation Network -arxiv2022, code
TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation -tencent, CVPR2022, code
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer apple, ICLR2022, code
TinyNetModel Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets -huawei, NeurIPS2020
GhostNet: More Features from Cheap Operations -huawei, CVPR2020
EfficientNet
SqueezeNet
Mobilenets -google, arxiv2017
MobileNet-V2 -google, CVPR2018 caffe-code
MobileNetV3
NasNet-A-Learning transferable architectures for scalable image recognition -google brain, CoRR2017
ShuffleNet -megvii, CoRR2017
ShuffleNetV2
ThunderNet
DarkNet/Tiny YOLOv3/Tiny YOLOv2/Yolo-Nano/SlimYOLO/YOLO-LITE/Gaussian YOLOv3
LightweightNet: Toward fast and lightweight convolutional neural networks via architecture distillation -XuTingbin, PR2019
Mobilefacenets
EXTD: Extremely Tiny Face Detector via Iterative Filter Reuse
Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution
HetConv: Heterogeneous Kernel-Based Convolutions for Deep CNNs
Joint Architecture and Knowledge Distillation in Convolutional Neural Network for Offline Handwritten Chinese Text Recognition -dujun, arxiv2019 Compressing CNN-DBLSTM models for OCR with teacher-student learning and Tucker decomposition -huoqiang, PR2019 vovnet
http://openaccess.thecvf.com/content_CVPRW_2019/papers/CEFRL/Lee_An_Energy_and_GPU-Computation_Efficient_Backbone_Network_for_Real-Time_Object_CVPRW_2019_paper.pdf

network

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios -bytedance, arxiv2022
TRT-ViT: TensorRT-oriented Vision Transformer -bytedance, arxiv2022

model compression

蒸馏:teacher-student/mutual-learning/Self-Distillation
张量分解:low-rank/SVD-decomposition/Tucker-decomposition/CP-decomposition
剪枝
量化
编码

InformationExtraction

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking -microsoft, arxiv2022, code
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding -jinlianwen, ACL2022, code
XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding -ant group, CVPR2022
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents -naver, AAAI2022, code
StructuralLM: Structural Pre-training for Form Understanding -alibaba, ACL2021
UniDoc: Unified Pretraining Framework for Document Understanding -adobe, NeurIPS2021
DocFormer: End-to-End Transformer for Document Understanding -amazon, ICCV2021
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis -ICDAR2021
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer -ICDAR2021
Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution -jinlianwen, AAAI2021
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding -microsoft,arxiv2021
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding -microsoft, ACL2021, code
SelfDoc: Self-Supervised Document Representation Learning -adobe, CVPR2021
Going full-tilt boogie on document understanding with text-image-layout transformer -ICDAR2021
LayoutLM: Pre-training of Text and Layout for Document Image Understanding -microsoft, KDD2020, code
TRIE: End-to-End Text Reading and Information Extraction for Document Understanding -hikvision, arxiv2020

knowledge distillation

Decoupled Knowledge Distillation -megvii, CVPR2022, code
Efficient knowledge distillation for rnn-transducer models -google/facebook, ICASSP2021
Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models -NICT japan, ICASSP2019
Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation -IBM, Interspeech2019
Explaining sequence-level knowledge distillation as data-augmentation for neural machine translation -arxiv2019
Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion -microsoft, Interspeech2019
Knowledge Distillation for Sequence Model -AISpeech, Interspeech2018
Improved knowledge distillation from bi-directional to uni-directional LSTM CTC for end-to-end speech recognition -IBM, SLT2018
An Investigation of a Knowledge Distillation Method for CTC Acoustic Models -NICT japan, ICASSP2018
Sequence-Level Knowledge Distillation -Yoon Kim, EMNLP2016

Document Rectification

Fourier Document Restoration for Robust Document Dewarping and Recognition -CVPR2022, bai song database
Document Dewarping with Control Points -ICDAR2021, code&dataset
Document Rectification and Illumination Correction using a Patch-based CNN -SIGGRAPH2019, code

Graph

Joint stroke classification and text line grouping in online handwritten documents with edge pooling attention networks -PR2021
A Comprehensive Survey on Graph Neural Networks -TNN2020
Contextual Stroke Classification in Online Handwritten Documents with Edge Graph Attention Networks -SNCS2020
Deepgcns: Can gcns go as deep as cnns? -ICCV2019
Heterogeneous graph attention network -WWW2019
Contextual Stroke Classification in Online Handwritten Documents with Graph Attention Networks -ICDAR2019
Graph Convolutional Networks for Text Classification -AAAI2019
Graph Attention Networks -ICLR2018
Semi-Supervised Classification with Graph Convolutional Networks -ICLR2017

super resolution

Real-esrgan: Training real-world blind super-resolution with pure synthetic data -tencent, ICCV2021, code
Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices -alibaba, ACMMM2021code
SplitSR: An End-to-End Approach to Super-Resolution on Mobile Devices -arxiv2021, code
Extremely Lightweight Quantization Robust Real-Time Single-Image Super Resolution for Mobile Devices -CVPR2021, code

deblur

Towards Efficient and Scale-Robust Ultra-High-Definition Image Demoireing -TCL, ECCV2022, database/code
Global-Local Stepwise Generative Network for Ultra High-Resolution Image Restoration -arxiv2022
A Survey on Deep learning based Document Image Enhancement -arxiv2021
NTIRE 2021 challenge for defocus deblurring using dual-pixel images: Methods and results -CVPR2021, code
Multi-Stage Progressive Image Restoration -google, CVPR2021, code
Learning frequency domain priors for image demoireing -PAMI2021, code
Morié Attack (MA): A New Potential Risk of Screen Photos -NIPs2021, code
Image demoireing with learnable bandpass filters -CVPR2020, code
WDNet: Watermark-Decomposition Network for Visible Watermark Removal -baixiang, WACV2021, database/code
High Resolution Demoire Network -ICIP2020, code
BEDSR-Net: A Deep Shadow Removal Network From a Single Document Image -CVPR2020, code