MachineLearningDOC: A repository from mowayao

图像、人脸、OCR、语音相关算法整理

1. 通用物体检测和识别（General Object Detection/Recognition）

2. 特定物体检测和识别和检索（Specific Object Detection/CBIR）

3. 物体跟踪（Object Tracking）

4. 物体分割（Object Segmentation）

5. 人脸检测（Face Detection）

6. 人脸关键点对齐（Face Alignment）

7. 人脸识别（Face Recognition）

8. 人像重建（Face Reconstruct）

9. OCR字符识别

10. 语音识别（Automatic Speech Recognition/Speech to Text）

11. 说话人识别（Speaker Recognition/Identification/Verification）

12. 说话人语音分割（Speaker Diarization）

13. 语音合成（Text To Speech）

14. 声纹转换（Voice Conversion）

15. 人脸生物特征（Age Gender）

通用物体检测和识别（General Object Detection/Recognition）

传统方法：

  1. 基于Bag Of Words词袋模型的，SIFT/SURF+KMeans+SVM
  2. 基于Sparse Coding稀疏编码的，LLC
  3. 基于聚合特征的，Fisher Vector/VLAD
  4. 基于变形部件组合模型的，DPM用到HOG/Latent SVM

相关论文：

  1. Visual Object Recognition, Kristen Grauman
  2. Locality-constrained Linear Coding for Image Classification 
  3. Fisher Kernels on Visual Vocabularies for Image Categorization
  4. Improving the Fisher Kernel for Large-Scale Image Classification 
  5. Aggregating local descriptors into a compact image representation
  6. Object Detection with Discriminatively Trained Part Based Models

相关开源地址：

深度学习：

RCNN/SPPNet/Faster RCNN，Yolo系列，SSD，R-FCN，RetinaNet，CFENet

相关论文：

1. Rich feature hierarchies for accurate object detection and semantic segmentation
2. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
3. Fast R-CNN
4. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
5. You Only Look Once: Unified, Real-Time Object Detection
6. YOLO9000: Better, Faster, Stronger
7. YOLOv3: An Incremental Improvemen
8. SSD: Single Shot MultiBox Detector
9. R-FCN: Object Detection via Region-based Fully Convolutional Networks
10. Focal Loss for Dense Object Detection
11. CFENet: An Accurate and Efficient Single-Shot Object Detector for Autonomous Driving

相关开源地址：

2. 特定物体检测和识别和检索（Specific Object Detection/CBIR）

特定物体只识别一张特定的图，不能进行大样本训练，也即不需要进行训练和学习。大多数只是用Artificial Feature手工特征，比如特征点，而且对于刚性物体，特征点匹配可以用SVD分解和RANSAC计算出仿射变换矩阵，进而判断物体边缘的方向。也有基于神经网络的，如R-MAC，NetVlad，但用的都是预训练模型，不具有旋转不变性。
特征点匹配，基于欧氏距离的，如SIFT/SURF，基于海明距离的，如AKAZE/FREAK，欧氏距离的检索可以用KD-Tree或者其他算法如hnsw、Falconn，海明距离的检索用LSH。
基于Fisher Vector/VLAD，采用随机超平面的方式切换成海明距离进行检索
检索，基于欧式距离的检索有hnsw、Falconn、Faiss等开源库。

相关论文：

Aggregating Deep Convolutional Features for Image Retrieval
PARTICULAR OBJECT RETRIEVAL WITH INTEGRAL MAX-POOLING OF CNN ACTIVATIONS

相关开源地址：

3. 物体跟踪（Object Tracking）

光流法
卡尔曼滤波器
均值漂移物体跟踪在OpenCV里面都有实现，大多都是针对刚性物体，对于人脸这种物体不适合。
深度学习的方法：
CFNet

相关论文：

End-to-end representation learning for Correlation Filter based tracking

相关开源地址：
- https://github.com/bertinetto/cfnet

4. 物体分割（Object Segmentation）

目前主流的都是基于神经网络的。
FCN、SegNet、PSPNet、MaskRCNN 、DeepLab系列、RefineNet、DeeperLab

相关论文：

1. Fully Convolutional Networks for Semantic Segmentation
2. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
3. Pyramid Scene Parsing Network
4. Mask R-CNN
5. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
6. Rethinking Atrous Convolution for Semantic Image Segmentation
7. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
8. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation
9. DeeperLab: Single-Shot Image Parser
10. MobileNetV2: Inverted Residuals and Linear Bottlenecks

相关开源地址：

5. 人脸检测（Face Detection）

传统方法：特征提取+分类器的方式

特征主要有HOG、HAAR等，分类器有Adaboost、SVM、Cascade等。
常用的开源库有：OpenCV、Dlib等。

深度学习：

MTCNN、PyramidBox、HR、Face R-CNN、SSH、RSA、S3FD、FaceBoxes

相关论文：

1. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
2. PyramidBox: A Context-assisted Single Shot Face Detector.
3. Finding Tiny Faces
4. Face R-CNN
5. SSH: Single Stage Headless Face Detector
6. Recurrent Scale Approximation for Object Detection in CNN
7. S 3FD: Single Shot Scale-invariant Face Detector
8. FaceBoxes: A CPU Real-time Face Detector with High Accuracy

相关开源地址：

6. 人脸关键点对齐（Face Alignment）

一些人脸检测算法中会集成有人脸关键点对齐，在训练时2个任务的误差函数加权相加。对齐有2D和3D的区别，2D只考虑二维信息，3D需要有3维模型，能预测人脸的姿态信息。
2D关键点对齐：DCNN、MTCNN、TCDCN、LAB
3D关键点对齐：3DDFA、DenseReg、FAN、PRNet、PIPA

相关论文：

1. Facial Landmark Detection by Deep Multi-task Learning
2. Deep Convolutional Network Cascade for Facial Point Detection
3. Look at Boundary: A Boundary-Aware Face Alignment Algorithm
4. Face Alignment Across Large Poses: A 3D Solution
5. Pose-Invariant Face Alignment via CNN-Based Dense 3D Model Fitting
6. Dense Face Alignment
7. DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild
8. How far are we from solving the 2D & 3D Face Alignment problem
9. Learning Dense Facial Correspondences in Unconstrained Images
10. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network
11. Dense Face Alignment

相关开源地址：

7. 人脸识别（Face Recognition）

非神经网络：GaussianFace高斯脸
深度学习：大多数和损失函数设计有关
DeepFace、DeepID系列、VGGFace、FaceNet、CenterLoss、MarginalLoss、SphereFace、ArcFace、AMSoftmax

相关论文：

1. Surpassing Human-Level Face Verification Performance on LFW with GaussianFace
2. DeepFace: Closing the Gap to Human-Level Performance in Face Verification
3. Deep Learning Face Representation from Predicting 10,000 Classes
4. Deep Learning Face Representation by Joint Identification-Verification
5. DeepID3: Face Recognition with Very Deep Neural Networks
6. Deep Face Recognition
7. FaceNet: A Unified Embedding for Face Recognition and Clustering
8. A Discriminative Feature Learning Approach for Deep Face Recognition
9. Marginal Loss for Deep Face Recognition
10. SphereFace: Deep Hypersphere Embedding for Face Recognition
11. ArcFace: Additive Angular Margin Loss for Deep Face Recognition
12. Additive Margin Softmax for Face Verification

相关开源地址:

8. 人像重建（Face Reconstruct）

基本上都是基于3D的，人像重建后可以进行姿态估计，以及换脸。有的换脸算法需要多张人脸训练GAN网络。
PRNet、VRN、Face2Face

相关论文：

1. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications
2. 3D Face Reconstruction with Geometry Details from a Single Image
3. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network
4. CNN-based Real-time Dense Face Reconstruction with Inverse-rendered Photo-realistic Face Images
5. Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression
6. Deep Video Portraits
7. VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track
8. paGAN: Real-time Avatars Using Dynamic Textures
9. On Face Segmentation, Face Swapping, and Face Perception
10. Extreme 3D Face Reconstruction: Looking Past Occlusions

相关开源地址:

9. OCR字符识别

OCR涉及到字符场景定位和分割，以及字符识别。传统的方法是采用垂直方向直方图形式对字符进行分割，然后一个个字符分别送入分类器进行识别。由于CTC动态规划算法的出现，当今的主流模型是LSTM+CTC，采用和语音识别类似的自动语素分割的方式。检测框一般是水平的，如果要纠正还需要用Hough变换把文本方向纠正。
字符区域检测： CTPN、TextBoxes++、AdvancedEast

相关论文：

1. Detecting Text in Natural Image with Connectionist Text Proposal Network
2. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes
3. Single Shot Scene Text Retrieval
4. EAST: An Efficient and Accurate Scene Text Detector
5. DeepTextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework
6. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild
7. Multi-Oriented Text Detection with Fully Convolutional Networks
8. Accurate Text Localization in Natural Image with Cascaded Convolutional Text Network
9. 总结Overview：https://github.com/whitelok/image-text-localization-recognition

字符识别： CRNN、GRCNN

相关论文：

1. Gated Recurrent Convolution Neural Network for OCR
2. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

相关开源地址：

10. 语音识别（Automatic Speech Recognition/Speech to Text）

传统方式基于GMM-HMM模型和Vertibi算法
深度学习：对WAV进行MFCC短时频谱信号提取，依次采用CNN卷积网络和LSTM循环网络以及CTC Loss误差函数进行建模。 GRU-CTC、DFCNN、DFSMN、DeepSpeech、CLDNN

相关论文：

1. FULLY SUPERVISED SPEAKER DIARIZATION
2. SPAKER DIARIZATION WITH LSTM
3. S4D: Speaker Diarization Toolkit in Python

相关开源地址：

13. 语音合成（Text To Speech）

文本转语音，传统方法是采用语素拼接，这种方式合成的语音比较生硬，没有语调。当前Baidu、Google、FaceBook等出了很多基于深度学习的方法。一般的流程是先Encoder再Decoder，最后用Griffin-Lim算法或者WaveNet自回归模型将MFCC变成wave信号。 WaveNet系列（MFCC-->WAVE）、DeepVoice系列、Tacotron系列、VoiceLoop、ClariNet

相关论文：

1. VOICELOOP: VOICE FITTING AND SYNTHESIS VIA A PHONOLOGICAL LOOP
2. TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS
3. NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS
4. Deep Voice: Real-time Neural Text-to-Speech
5. Deep Voice 2: Multi-Speaker Neural Text-to-Speech
6. DEEP VOICE 3: 2000-SPEAKER NEURAL TEXT-TO-SPEECH
7. WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
8. Parallel WaveNet: Fast High-Fidelity Speech Synthesis
9. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
10. SAMPLE EFFICIENT ADAPTIVE TEXT-TO-SPEECH

相关开源地址：

14. 声纹转换（Voice Conversion）

声纹转换其实就是TTS的多人版，根据说话人的不同将文本生成不同的wave信号。大多数都是在网络架构中加入说话人Embedding向量，如DeepVoice2/DeepVoice3，Tacotron2，有的甚至会在声码器Vocoder中加入，比如WaveNet。

相关开源地址：

14. 人脸生物特征（Age Gender Estimate）

经典的DEX模型，SSR-NET精简模型

相关论文：

1. DEX: Deep EXpectation of apparent age from a single image
2. Age Progression/Regression by Conditional Adversarial Autoencode
3. SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation
4. Deep Regression Forests for Age Estimation

相关开源地址：

mowayao/MachineLearningDOC

图像、人脸、OCR、语音相关算法整理

1. 通用物体检测和识别（General Object Detection/Recognition）

2. 特定物体检测和识别和检索（Specific Object Detection/CBIR）

3. 物体跟踪（Object Tracking）

4. 物体分割（Object Segmentation）

5. 人脸检测（Face Detection）

6. 人脸关键点对齐（Face Alignment）

7. 人脸识别（Face Recognition）

8. 人像重建（Face Reconstruct）

9. OCR字符识别

10. 语音识别（Automatic Speech Recognition/Speech to Text）

11. 说话人识别（Speaker Recognition/Identification/Verification）

12. 说话人语音分割（Speaker Diarization）

13. 语音合成（Text To Speech）

14. 声纹转换（Voice Conversion）

15. 人脸生物特征（Age Gender）