In this project we are going to implement a system which use CNN to detect objects in a picture using SSD (Single-Shot MultiBox Detector) algorithm
SSD is good in both speed of detection and accuracy
Desing and implement a system using SSD algorithm to detect objects in a picture or video real time.
- Object localization (we will see how to combine classification and regression).
- Broden our scope from object localiztion to object detection
- Sliding windows efficient implementation.
- Problem of scale, we will deal with this problem with caused by the distance of objects in a picture from the point which has taken
- SSD architecture for industrial usage
- Modify the SSD algorithm to work on videos.
- Jaccard index(IoU), non-max suppression
In this concept, we don't just want to know what are in the image, we want to know where they are
To aim this purpose, we need five logistic regression on top of the ResNet for detecting class, x center, y center, height ,and width.
Our loss function includes three parts
- Binary Cross Entropy: p(object | image): this part tells us whether or not there is even an object in the image.
- Categorical Cross Entropy: p(class 1 | image), p(class 2 | image) ... p(class k | image): this part tells us which class objects belong to
- MSE : in this part we have four regression output for bounding box(CX, CY, Height, Width)(should not contribute to loss when there is no object in the image)
This is a generalized version of object localization. In this concept we may have 0 or several objects within an image. The goal is to detect all of them and draw rect around each object.
- Worth thinking about: what kind of data structures do we need?
- A CNN must output a fixed set of numbers
- But an image may have 0 objects, or it may have 50- how can it output the right numbers for all cases?
- Naive strategy: in a loop
- Look for object with highest class confidence
- Output its p(class | image), cx, cy, height, width
- Erase that object from the image
Sliding window technique: take some window and for each position in the original image pass this sub-image to the CNN. One of the major problem of this method is its low speed, O(N^2). To solve this problem we would use convolution operation.
SSD: The main concept is that by using CNN we would get same result as sliding window by passing the image through CNN just one time, that's why its name is single-shot. One more advantagous of this algorithm is that there is no need to tell the CNN which regions may have objects
There are objects that may seem very small because of their distance to the camera, how can solve this problem?
The general pattern of CNN is that you go through each layer the image is shirinking and therefore the features you are finding go from small to big. The idea is attach mini-neural network to intermediate layers of a pre-trained network. For each output we will do object detection separately.
- Windwo Size: In a picture there are objects with different sizes, for example people are tall and cars are wide, so what size should the window be?
- We might be looking at a window where both objects might appear in the same window with one occluding the other.
- Different angle of an object: for example a person may lay down
Solution is: instead of one window, use default boxes in each position, for each rect we try to detect an object by passing it through our CNN
We not only look at the image at multiple scales but we apply each box to each window at each scale
- Download the tensorflow/models repository:
git clone https://github.com/tensorflow/models.git
- Start Notebook inside research/object_detection folder
- Install Protocol Buffers: (windows)
conda install -c anaconda protobuf
To ensure about correct installationprotoc --version
- Run this from the "research folder":
protoc object_detection/protos/*.proto --python_out=.
- Exmaple command for an image:
python main.py --content image --path "./sea.jpg"
- Exmaple command for a video:
python main.py --content video --path "./traffic.mp4"