deepinsight/insightface

MobileFaceNet training pipeline

nttstar opened this issue · 44 comments

MobileFaceNet training pipeline

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

Can you share mobilenet v2 training pipeline?

what is the accuracy on LFW and AgeDB after trained by softmax, can you share the training log?

Hi, can I ask in this thread?
Which type of filling have you used during network creation? xavier or something else?
I'm newbie to MXNet, trying to reproduce your result in Torch7

I used mxnet to calculate the cosine distance of the value of fc1 output, the output is wrong. The model is downloaded from the above Baidu cloud, and then the picture is used by two different men and women, has been aligned with the lfww mtcnn picture of.

`#coding=utf-8
import mxnet as mx
import numpy as np
import math
import cv2
from collections import namedtuple
from sklearn import preprocessing
Batch= namedtuple('Batch', ['data'])

image_size = (112,112)
batch_size = 2

def load_model(model_prefix):
sym, arg_params, aux_params = mx.model.load_checkpoint(model_prefix, 0)
all_layers = sym.get_internals()
sym = all_layers['fc1_output']
model = mx.mod.Module(symbol=sym, label_names = None)
model.bind(data_shapes=[('data', (2, 3, image_size[0], image_size[1]))])
model.set_params(arg_params, aux_params)
return model

def dis(x,y):
return np.dot(x, y)/np.linalg.norm(x)/np.linalg.norm(y)

def test(model_prefix):
img_path_1 = "./img_test/41.jpg"
img_path_2 = "./img_test/31.jpg"
model = load_model(model_prefix)
img1 = cv2.cvtColor(cv2.imread(img_path_1), cv2.COLOR_BGR2RGB)
img1 = cv2.resize(img1, (112, 112), interpolation=cv2.INTER_CUBIC)
img2 = cv2.cvtColor(cv2.imread(img_path_2), cv2.COLOR_BGR2RGB)
img2 = cv2.resize(img2, (112, 112), interpolation=cv2.INTER_CUBIC)
img1 = np.transpose(img1, axes=(2, 0, 1))
img2 = np.transpose(img2, axes=(2, 0, 1))
data_batch = []
data_batch.append(img1)
data_batch.append(img2)
data_batch = np.array(data_batch)
print(data_batch.shape)
print(img2.shape)
model.forward(Batch([mx.nd.array(data_batch)]))
prob = model.get_outputs()[0].asnumpy()
print(dis(prob[0],prob[1]))

model_prefix = "../../models/model"
test(model_prefix)`

#############Here are the output########

[00:19:53] src/nnvm/legacy_json_util.cc:190: Loading symbol saved by previous version v1.0.0. Attempting to upgra de... [00:19:53] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded! (2, 3, 112, 112) (3, 112, 112) -0.9996472

could you tell me what have I missed? @nttstar

why you thought the result was wrong?

If the images were already aligned, why you resized them again in your code?

I have just croped the image by the boxes, I need to resize the image to the input shape. I hava found you code in deploy dir, I am analyzing my mistakes by comparing my code with your code, thank you a lot !

# #The model I got is too big
i used the code:
CUDA_VISIBLE_DEVICES='0' python -u train_softmax.py --network y1 --ckpt 2 --loss-type 0 --lr-steps 120000,140000 --wd 0.00004 --fc7-wd-mult 10 --per-batch-size 512 --emb-size 128 --data-dir ../datasets/faces_ms1m_112x112 --prefix ../models/MobileFaceNet/model-y1-softmax
to got my model。but i found this model is almost 40M. i have no idea why i got so much big model comparing to yours? PLAESE HELP ME

@BUAA-21Li
your model is too big cause of last fc layer, before softmax layer.

@BUAA-21Li use deploy/model_slim.py to delete last layer

Why have you pre-trained a model with softmax loss when training MobileFaceNet with Arcface loss, but training other networks from the scratch?

@wayen820 THANKS ! I have solved it!

**now we get more higher accuray using my modified mobilenet network

[lfw][12000]Accuracy-Flip: 0.99617+-0.00358
[agedb_30][12000]Accuracy-Flip: 0.96017+-0.00893 .

@youyicloud is your problem solved? my code is similar to yours and the consine distances from samples are all around -0.99,no matter positive or negative samples.

@BUAA-21Li you can use deploy/test.py and load mobilefacenet model, then you can use the consine distance or the Euclidean Distance. It can output the right answer~

@youyicloud thank you for your reply.Have you analyzed why your code failed getting correct result.

In the article, you have reported results for LResNet100E-IR (for m=0.5):
LFW: 99.83 , CFP-FP: 94.04, AgeDB-30 98.08

With the Mobilenet (m=?) you report the accuracies:
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

What is the expected accuracy drop of this model on MegaFace Challenge 1 (Table 9 from the article)?

Thanks for your code. Recently I was trying to reproduce the mobile facenet model by your instructions, yet I encountered some question as following, would you please give me some hints. (P.S. the training dataset was combined faces_ms1m_112x112 with my private dataset, using scripts like "im2rec.py", "face2rec2.py" and "dataset_merge.py".)


root@656688c713aa:/proj/insightface/src# CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir ../datasets/xl_marked --network y1 --loss-type 0 --prefix ../mobile_facenet --per-batch-size 128 --lr-steps "240000,360000,440000" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
gpu num: 4
num_layers 1
image_size [112, 112]
num_classes 381
Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ckpt=2, ctx_num=4, cutoff=0, data_dir='../datasets/xl_marked', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0, fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_w=112, loss_type=0, lr=0.1, lr_steps='240000,360000,440000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.1, margin_s=32.0, max_steps=140002, mom=0.9, network='y1', num_classes=381, num_layers=1, per_batch_size=128, power=1.0, prefix='../mobile_facenet', pretrained='', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000, version_act='prelu', version_input=1, version_output='E', version_se=0, version_unit=3, wd=4e-05)
init mobilefacenet 1
('version_output:', 'E')
Traceback (most recent call last):
File "train_softmax.py", line 488, in
main()
File "train_softmax.py", line 485, in main
train_net(args)
File "train_softmax.py", line 334, in train_net
sym, arg_params, aux_params = get_symbol(args, arg_params, aux_params)
File "train_softmax.py", line 170, in get_symbol
embedding = fmobilefacenet.get_symbol(args.emb_size, bn_mom = args.bn_mom, version_output=args.version_output)
File "symbols/fmobilefacenet.py", line 51, in get_symbol
assert version_output=='GDC' or version_output=='GNAP'
AssertionError


@EdwardChou add "--version-output GNAP" to argument

@shangleyi Thanks for reply. After append "--version-output GNAP" to argument, run, and another error pop out, yet I am using the correct input size, namely 112*112 input images. This is pretty wired.

expected [3,160,160], got [3,112,112]

The complete log is as following:

root@656688c713aa:/proj/insightface/src# CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir ../datasets/marked_face_crop --network y1 --loss-type 0 --prefix ../mobile_facenet --per-batch-size 128 --lr-steps "240000,360000,440000" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002 --version-output GNAP
gpu num: 4
num_layers 1
image_size [112, 112]
num_classes 381
Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ckpt=2, ctx_num=4, cutoff=0, data_dir='../datasets/marked_face_crop', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0, fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_w=112, loss_type=0, lr=0.1, lr_steps='240000,360000,440000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.1, margin_s=32.0, max_steps=140002, mom=0.9, network='y1', num_classes=381, num_layers=1, per_batch_size=128, power=1.0, prefix='../mobile_facenet', pretrained='', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000, version_act='prelu', version_input=1, version_output='GNAP', version_se=0, version_unit=3, wd=4e-05)
init mobilefacenet 1
('version_output:', 'GNAP')
INFO:root:loading recordio ../datasets/marked_face_crop/train.rec...
header0 label [  9369.  18696.]
id2range 9327
9368
rand_mirror 1
lr_steps [240000, 360000, 440000]
call reset()
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mxnet/python/mxnet/io.py", line 396, in prefetch_func
    self.next_batch[i] = self.iters[i].next()
  File "/proj/insightface/src/image_iter.py", line 215, in next
    batch_data[i][:] = self.postprocess_data(datum)
  File "/mxnet/python/mxnet/ndarray/ndarray.py", line 437, in __setitem__
    self._set_nd_basic_indexing(key, value)
  File "/mxnet/python/mxnet/ndarray/ndarray.py", line 691, in _set_nd_basic_indexing
    value.copyto(self)
  File "/mxnet/python/mxnet/ndarray/ndarray.py", line 1876, in copyto
    return _internal._copyto(self, out=other)
  File "<string>", line 25, in _copyto
  File "/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/mxnet/python/mxnet/base.py", line 146, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [13:43:04] src/operator/nn/./../tensor/../elemwise_op_common.h:123: Check failed: assign(&dattr, (*vec)[i]) Incompatibleattr in node  at 0-th output: expected [3,160,160], got [3,112,112]

Stack trace returned 10 entries:
[bt] (0) /mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f5416c1559a]
[bt] (1) /mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f5416c16138]
[bt] (2) /mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseAttr<nnvm::TShape, &mxnet::op::shape_is_none, &mxnet::op::shape_assign, true, &mxnet::op::shape_string[abi:cxx11], -1, -1>(nnvm::NodeAttrs const&, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, nnvm::TShape const&)::{lambda(std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, unsigned long, char const*)#1}::operator()(std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, unsigned long, char const*) const+0xbf1) [0x7f5416e6da61]
[bt] (3) /mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<1, 1>(nnvm::NodeAttrs const&, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*)+0x24a) [0x7f5416e6ff7a]
[bt] (4) /mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0xb4d) [0x7f54191c0e1d]
[bt] (5) /mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x35f) [0x7f5419198d8f]
[bt] (6) /mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, charconst**)+0xe7b) [0x7f541968d4eb]
[bt] (7) /mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x3ff) [0x7f541968ecaf]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f5494337e40]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f54943378ab]



[13:43:06] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, thiscan take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/mxnet/python/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.25 vs. 0.001953125). Is this intended?
  optimizer_params=optimizer_params)
Killed

@EdwardChou How did you prepare train.rec

Hi, @shangleyi
This is my way to generate train.rec.

cd PROJ_DIR/src/data

download im2rec.py, modify script follow #265

# 160*160*3 -> 112*112*3
python im2rec.py --list --resize 112 --recursive ./my_data IMG_DIR

echo "100,112,112" > property

Modify line to "with open('IMG_DIR' + fullpath, 'rb') as fin:"

python face2rec2.py  . 

# Move generated dataset to PROJ_DIR/datasets/MY_DATASET
python dataset_merge.py --include "../../datasets/faces_ms1m_112x112/,../../datasets/MY_DATASET/" --output "../../datasets/MY_MERGE_DATASET/"

@EdwardChou I used face2rec2.py directly without using im2rec.py and it worked. Maybe you should write a script which resizes the images and then use face2rec2.py directly. I'm not so sure about im2rec.py.

training dataset: ms1m, ms1m-v2. private dataset
lfw: 99.583, cfp_fp: 95.357, agedb_30: 96.533
training process: https://github.com/shangleyi/insightface-training-note/blob/master/README.md

@shangleyi Thanks you so much. My problem is exactly the resize function in im2rec.py doesn't work. So I resize the images with another script. Currently the training procedure following instruction above looks good. You save my day!

Is there any training file corresponding to Caffe? I want to use Caffe training.
(有没有对应 caffe 的 训练的文件,我想用caffe训练)

dataset: emore
network backbone: mobilefacenet + GNAP block
loss function: arcface(m=0.5)
training pipeline: finetune (lr drop at 100K, 140K, 160K), batch-size:512
one epoch 52: LFW-99.60%, CFP-FP-93.46%, AgeDB-95.45%

Hi, @nttstar I encounter some strange thing when I finetune mobile-facenet model (2nd step of 2-step pipeline) and would like to ask for your help. My training acc got stuck in 0.51~0.53 while accuracy of lfw, agedb-30 reach 95%. Similar to #187

My finetune param is like:

Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ckpt=2,            ctx_num=4, cutoff=0, data_dir='../datasets/x', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0,    fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_w=112, loss_type=4, lr=0.1,         lr_steps='100000,140000,160000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.5, margin_s=64.0, max_steps=0, mom=0.  9, network='y1', num_classes=94491, num_layers=1, per_batch_size=128, power=1.0, prefix='../xz/xz_mobile_facenet', pretrained='../xz_mobile_facenet,70',               rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000,          version_act='prelu', version_input=1, version_output='GNAP', version_se=0, version_unit=3, wd=4e-05)

and the result is like:

 INFO:root:Epoch[145] Batch [1780]   Speed: 851.07 samples/sec   acc=0.529687
 INFO:root:Epoch[145] Batch [1800]   Speed: 866.48 samples/sec   acc=0.529980
 INFO:root:Epoch[145] Batch [1820]   Speed: 725.38 samples/sec   acc=0.519043
 INFO:root:Epoch[145] Batch [1840]   Speed: 919.19 samples/sec   acc=0.527051
 INFO:root:Epoch[145] Batch [1860]   Speed: 996.87 samples/sec   acc=0.525586
 INFO:root:Epoch[145] Batch [1880]   Speed: 1021.45 samples/sec  acc=0.521094
 lr-batch-epoch: 0.0001 1894 145
 testing verification..
 (12000, 128)
 infer time 39.693939
 [lfw][1082000]XNorm: 11.132285
 [lfw][1082000]Accuracy-Flip: 0.99517+-0.00398
 testing verification..
 (14000, 128)
 infer time 42.053231
 [cfp_fp][1082000]XNorm: 9.771846
 [cfp_fp][1082000]Accuracy-Flip: 0.88900+-0.02205
 testing verification..
 (12000, 128)
 infer time 34.666512
 [agedb_30][1082000]XNorm: 11.260081
 [agedb_30][1082000]Accuracy-Flip: 0.95383+-0.00796
 saving 541

I have seen your training log attach in baiduyun, the log shows the acc of your model reach 0.5 after 15 epoch, which is the same to my experiment result. Yet the your log stop at 24 epoch when the highest acc reach 0.55. Did you conduct further experiment to reach higher accuracy? Or there is something wrong with the calculation of training acc? Looking for your help, Thanks.

Hi guys,
for the first step in the training pipeline, usually how many epochs do you use to get a reasonable accuracy of LFW, such as 99%?
I trained for a long time, the accuracy is always around 91%.

clhne commented

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

@nttstar
按照你的配置,请问训练了多长时间?达到 LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91这样的准确率。

clhne commented

Hi guys,
for the first step in the training pipeline, usually how many epochs do you use to get a reasonable accuracy of LFW, such as 99%?
I trained for a long time, the accuracy is always around 91%.

@karlTUM
CPU E5-2650, v4
GPU 2x RTX2080Ti
Epoch 15, batch_size 32, lr 0.001

INFO:root:Epoch[15] Batch [32040-32060] Speed: 274.12 samples/sec acc=0.865625
INFO:root:Epoch[15] Batch [32060-32080] Speed: 272.38 samples/sec acc=0.839063
INFO:root:Epoch[15] Batch [32080-32100] Speed: 272.94 samples/sec acc=0.855469
INFO:root:Epoch[15] Batch [32100-32120] Speed: 272.41 samples/sec acc=0.839063
INFO:root:Epoch[15] Batch [32120-32140] Speed: 272.01 samples/sec acc=0.852344
INFO:root:Epoch[15] Batch [32140-32160] Speed: 267.44 samples/sec acc=0.855469
INFO:root:Epoch[15] Batch [32160-32180] Speed: 273.78 samples/sec acc=0.853125
INFO:root:Epoch[15] Batch [32180-32200] Speed: 274.96 samples/sec acc=0.851562
INFO:root:Epoch[15] Batch [32200-32220] Speed: 273.08 samples/sec acc=0.842187
INFO:root:Epoch[15] Batch [32220-32240] Speed: 273.76 samples/sec acc=0.849219
lr-batch-epoch: 0.0001 32249 15
testing verification..
(12000, 512)
infer time 25.010638999999994
[lfw][924000]XNorm: 23.051082
[lfw][924000]Accuracy-Flip: 0.99700+-0.00296
testing verification..
(14000, 512)
infer time 29.09600100000001
[cfp_fp][924000]XNorm: 23.878208
[cfp_fp][924000]Accuracy-Flip: 0.92786+-0.01553
testing verification..
(12000, 512)
infer time 24.954134000000025
[agedb_30][924000]XNorm: 23.627240
[agedb_30][924000]Accuracy-Flip: 0.97650+-0.01031
saving 462

clhne commented

512

similar issue.
我这边epoch 17, acc已经达到了0.9,但后面提升就很慢了
请问:

  1. CPU型号是?几颗?
  2. 显卡型号是?用了几张?

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

@nttstar
请问log文件是怎么自动生成的呢?谢谢~

Hi, @shangleyi
This is my way to generate train.rec.

cd PROJ_DIR/src/data

download im2rec.py, modify script follow #265

# 160*160*3 -> 112*112*3
python im2rec.py --list --resize 112 --recursive ./my_data IMG_DIR

echo "100,112,112" > property

Modify line to "with open('IMG_DIR' + fullpath, 'rb') as fin:"

python face2rec2.py  . 

# Move generated dataset to PROJ_DIR/datasets/MY_DATASET
python dataset_merge.py --include "../../datasets/faces_ms1m_112x112/,../../datasets/MY_DATASET/" --output "../../datasets/MY_MERGE_DATASET/"

Hi, have you managed to get correct merged dataset?
We also tried to merge the two datasets: faces_emore and faces_glint with dataset_merge.py with the following code:
python dataset_merge.py --include /home/ti/Downloads/DATASETS/faces_emore,/home/ti/Downloads/DATASETS/faces_glint --output /home/ti/Downloads/DATASETS/merge --model /home/ti/Downloads/insightface/models/model-r100-ii/model,0
But after the merge completed the resulting dataset had the same property and .rec and .idx sizes as faces_emore dataset.
What is wrong with our parameters?

Thank you!

It has been a year and I can hardly remember what did I do, but did you try adding the quotation marks?

Trained mobileFaceNet on emore, here is the result:

Called with argument: Namespace(batch_size=224, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ce_loss=False, ckpt=1, color=0, ctx_num=1, cutoff=0, data_dir='../datasets/faces_emore', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0, fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_size='112,112', image_w=112, images_filter=0, loss_type=4, lr=0.1, lr_steps='200000,280000,320000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.5, margin_s=64.0, max_steps=0, mom=0.9, network='y1', num_classes=85742, num_layers=1, per_batch_size=224, power=1.0, prefix='../models/y1-arcface-emore/model', pretrained='../models/y1-softmax-emore/model,234', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000, version_act='prelu', version_input=1, version_multiplier=1.0, version_output='E', version_se=0, version_unit=3, wd=4e-05)

testing verification..
(12000, 128)
infer time 5.607243
[lfw][346000]XNorm: 11.406996
[lfw][346000]Accuracy-Flip: 0.99600+-0.00442
testing verification..
(14000, 128)
infer time 6.47071
[cfp_fp][346000]XNorm: 9.418514
[cfp_fp][346000]Accuracy-Flip: 0.94729+-0.01445
testing verification..
(12000, 128)
infer time 5.542683
[agedb_30][346000]XNorm: 11.237676
[agedb_30][346000]Accuracy-Flip: 0.96300+-0.00942

What does accuracy_flip mean? Does it have to do with using features of flipped images during training?(as described in one of the mobileface papers?)
Or flipping during post processing while calculating embedding distance?

@nttstar你好,如何finetune自己的数据,能提供保护fc7的预训练模型吗?

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

your max-steps is 140002(140k), but you said 120k, lr-steps is 240000(240k), 360000(360k), ... , which is right?

dataset: emore
network backbone: mobilefacenet + GNAP block
loss function: arcface(m=0.5)
training pipeline: finetune (lr drop at 100K, 140K, 160K), batch-size:512
one epoch 52: LFW-99.60%, CFP-FP-93.46%, AgeDB-95.45%
@erichouyi

what's your acc on train data

My 2-stage pipeline:

  1. Train softmax with lr=0.1 for 120K iterations.
LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
  1. Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.
LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

Which version of ms1m did you use? I trained mobilefacenet with the ms1m-refine-v1 dataset and the same config (except that I used 2 GPUs with per_batch_size=256) but the maximum accuracy on LFW in 180K iterations was 0.99400.

@bahar3474 hello,excuse me ,Where is the train_SOFTmax file? There is no such file in the branch of the new version

@CasonTsai
Hi. I used this version of code:
https://github.com/deepinsight/insightface/blob/08265c749a7af6f1d7e9057df55a3eb2b171ddcb/src/train_softmax.py
Two months ago they refined the repo structure and I don't know where you can find it in new version.