Two of the big contributions include:
- Angular Softmax (A-Softmax) loss: The paper introduced a new loss function specifically tailored to the task of face recognition called the Angular Softmax (A-Softmax) loss. The innovation with this loss function was that it allowed the network to learn angularly distributed feature embeddings, making the representations more discrimative for face recognition tasks.
- Hypersphere Manifold: In relation to the above, the paper also introduced the notion of embedding the features into a hypersphere manifold, as opposed to a traditional Euclidean space, helping improve the separability and compactness of the embeddings.
The researchers note that for face recognition, the features learned by a softmax loss have an intrinsic angular distribution, thus Euclidean margin constraints were not the best choice, hence the motivation for the A-Softmax loss.
The A-Softmax loss is designed to incorporate an angular margin into the learning process, which introduces a geometric constraint. The angular margin is controlled by a parameter
In binary classification,
In multi-class classification,
For multi-class classification against the LFW dataset, the paper trained a 64-layer CNN using A-Softmax loss with an
Beside reporting accuracy, the paper also proposed and used the "Angular Fisher score" for evaluating the feature discriminativeness. It is defined as:
Where the within-class scatter value
And the between-class scatter value
Data loading and augmentation | 10% | ✅ |
Design of neural networks | 15% | ✅ |
Loss function | 10% | ✅ |
Training | 15% | ✅ |
Testing and results | 10% | ✅ |
Code format | 10% | ✅ |
For augmentation, the following transformations were used:
transforms.Compose(
[
transforms.Resize(96),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
]
)
The nural follows the 4-layer CNN architecture given in the paper:
class SphereCNN(nn.Module):
def __init__(self, class_num: int, feature=False):
super(SphereCNN, self).__init__()
self.class_num = class_num
self.feature = feature
self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=2)
self.conv2 = nn.Conv2d(
in_channels=32, out_channels=128, kernel_size=3, stride=2
)
self.conv3 = nn.Conv2d(
in_channels=128, out_channels=256, kernel_size=3, stride=2
)
self.conv4 = nn.Conv2d(
in_channels=256, out_channels=512, kernel_size=3, stride=2
)
self.fc5 = nn.Linear(512 * 5 * 5, 512)
self.angular = AngularPenaltySMLoss(512, self.class_num)
def forward(self, x, y):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = F.relu(
self.conv4(x)
) # batch_size (0) * out_channels (1) * height (2) * width (3)
x = x.view(x.size(0), -1) # batch_size (0) * (out_channels * height * width)
x = self.fc5(x)
if self.feature:
return x
else:
x_angle = self.angular(x, y)
return x, x_angle
The loss function used is the A-Softmax loss:
class AngularPenaltySMLoss(nn.Module):
def __init__(self, in_features, out_features, eps=1e-7, m=None):
super(AngularPenaltySMLoss, self).__init__()
self.m = 4.0 if not m else m
self.in_features = in_features
self.out_features = out_features
self.fc = nn.Linear(in_features, out_features, bias=False)
self.eps = eps
def forward(self, x, labels):
"""
input shape (N, in_features)
"""
assert len(x) == len(labels)
assert torch.min(labels) >= 0
assert torch.max(labels) < self.out_features
for W in self.fc.parameters():
W = F.normalize(W, p=2, dim=1)
x = F.normalize(x, p=2, dim=1)
wf = self.fc(x)
numerator = torch.cos(
self.m
* torch.acos(
torch.clamp(
torch.diagonal(wf.transpose(0, 1)[labels]),
-1.0 + self.eps,
1 - self.eps,
)
)
)
excl = torch.cat(
[
torch.cat((wf[i, :y], wf[i, y + 1 :])).unsqueeze(0)
for i, y in enumerate(labels)
],
dim=0,
)
denominator = torch.exp(numerator) + torch.sum(torch.exp(excl), dim=1)
L = numerator - torch.log(denominator)
return -torch.mean(L)
The results of my training and testing can be found in the following table documenting a training run of 100 epochs, calculating the accuracy on the LFW test dataset, every 20 epochs:
Epoch | Loss | Test Acc. |
---|---|---|
04 | 6.836 | |
09 | 6.675 | |
14 | 6.516 | |
19 | 6.334 | 0.5840 |
24 | 6.154 | |
29 | 5.973 | |
34 | 5.798 | |
39 | 5.632 | 0.6230 |
44 | 5.457 | |
49 | 5.282 | |
54 | 5.123 | |
59 | 4.940 | 0.6520 |
64 | 4.766 | |
69 | 4.612 | |
74 | 4.449 | |
79 | 4.278 | 0.6510 |
84 | 4.123 | |
89 | 3.961 | |
94 | 3.782 | |
99 | 3.657 | 0.6490 |
For training, an
As you can see, overfitting appears to occur during the latter epochs.