MobileFaceNet training pipeline

Question

MobileFaceNet training pipeline

nttstar opened this issue 7 years ago · 44 comments

Answer 1 · 2018-05-16T15:02:03.000Z

My 2-stage pipeline:

Train softmax with lr=0.1 for 120K iterations.

LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002

Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.

LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

Answer 2 · 2018-05-17T07:03:18.000Z

Can you share mobilenet v2 training pipeline?

Answer 3 · 2018-05-17T08:35:16.000Z

what is the accuracy on LFW and AgeDB after trained by softmax, can you share the training log?

Answer 4 · 2018-05-23T01:29:41.000Z

Hi, can I ask in this thread?
Which type of filling have you used during network creation? xavier or something else?
I'm newbie to MXNet, trying to reproduce your result in Torch7

Answer 5 · 2018-05-26T11:54:25.000Z

I used mxnet to calculate the cosine distance of the value of fc1 output, the output is wrong. The model is downloaded from the above Baidu cloud, and then the picture is used by two different men and women, has been aligned with the lfww mtcnn picture of.

`#coding=utf-8
import mxnet as mx
import numpy as np
import math
import cv2
from collections import namedtuple
from sklearn import preprocessing
Batch= namedtuple('Batch', ['data'])

image_size = (112,112)
batch_size = 2

def load_model(model_prefix):
sym, arg_params, aux_params = mx.model.load_checkpoint(model_prefix, 0)
all_layers = sym.get_internals()
sym = all_layers['fc1_output']
model = mx.mod.Module(symbol=sym, label_names = None)
model.bind(data_shapes=[('data', (2, 3, image_size[0], image_size[1]))])
model.set_params(arg_params, aux_params)
return model

def dis(x,y):
return np.dot(x, y)/np.linalg.norm(x)/np.linalg.norm(y)

def test(model_prefix):
img_path_1 = "./img_test/41.jpg"
img_path_2 = "./img_test/31.jpg"
model = load_model(model_prefix)
img1 = cv2.cvtColor(cv2.imread(img_path_1), cv2.COLOR_BGR2RGB)
img1 = cv2.resize(img1, (112, 112), interpolation=cv2.INTER_CUBIC)
img2 = cv2.cvtColor(cv2.imread(img_path_2), cv2.COLOR_BGR2RGB)
img2 = cv2.resize(img2, (112, 112), interpolation=cv2.INTER_CUBIC)
img1 = np.transpose(img1, axes=(2, 0, 1))
img2 = np.transpose(img2, axes=(2, 0, 1))
data_batch = []
data_batch.append(img1)
data_batch.append(img2)
data_batch = np.array(data_batch)
print(data_batch.shape)
print(img2.shape)
model.forward(Batch([mx.nd.array(data_batch)]))
prob = model.get_outputs()[0].asnumpy()
print(dis(prob[0],prob[1]))

model_prefix = "../../models/model"
test(model_prefix)`

#############Here are the output########

[00:19:53] src/nnvm/legacy_json_util.cc:190: Loading symbol saved by previous version v1.0.0. Attempting to upgra de... [00:19:53] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded! (2, 3, 112, 112) (3, 112, 112) -0.9996472

could you tell me what have I missed? @nttstar

Answer 6 · 2018-05-26T13:30:11.000Z

why you thought the result was wrong?

Answer 7 · 2018-05-26T16:10:21.000Z

Because when I use the two image which are from the same person，the output is -0.99964917，which is similar to the images from different people as I wrote above. How can I tell the two images from the same people or not? what is the threshold? 来自魅族手机

…

-------- 原始邮件 -------- 发件人：Jia Guo <notifications@github.com> 时间：周六 5月26日 21:30 收件人：deepinsight/insightface <insightface@noreply.github.com> 抄送：youyicloud <yanghy@youyicloud.com>,Comment <comment@noreply.github.com> 主题：Re: [deepinsight/insightface] MobileFaceNet training pipeline (#214)

why you thought the result was wrong? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread. ***@***.******@***.******@***.***":"ViewAction","target":"#214 (comment)","url":"https://github.com/deepinsight/insightface/issues/214#issuecomment-392261639","name":"View Issue"},"description":"View this Issue on ***@***.***":"Organization","name":"GitHub","url":"https://github.com"}} {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/deepinsight/insightface","title":"deepinsight/insightface","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in ***@***.*** in #214: why you thought the result was wrong?"}],"action":{"name":"View Issue","url":"#214 (comment)"}}} { ***@***.***": "MessageCard", ***@***.***": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "37567f93-e2a7-4e2a-ad37-a9160fc62647", "title": "Re: [deepinsight/insightface] MobileFaceNet training pipeline (#214)", "sections": [ { "text": "", "activityTitle": "**Jia Guo**", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": ***@***.***", "facts": [ ] } ], "potentialAction": [ { "name": "Add a comment", ***@***.***": "ActionCard", "inputs": [ { "isMultiLine": true, ***@***.***": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", ***@***.***": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"deepinsight/insightface\",\n\"issueId\": 214,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}" } ] }, { "name": "Close issue", ***@***.***": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueClose\",\n\"repositoryFullName\": \"deepinsight/insightface\",\n\"issueId\": 214\n}" }, { "targets": [ { "os": "default", "uri": "#214 (comment)" } ], ***@***.***": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", ***@***.***": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 335514256\n}" } ], "themeColor": "26292E" }

Answer 8 · 2018-05-27T01:17:24.000Z

If the images were already aligned, why you resized them again in your code?

Answer 9 · 2018-05-27T04:22:22.000Z

I have just croped the image by the boxes, I need to resize the image to the input shape. I hava found you code in deploy dir, I am analyzing my mistakes by comparing my code with your code, thank you a lot !

Answer 10 · 2018-06-01T06:55:25.000Z

# #The model I got is too big
i used the code:
CUDA_VISIBLE_DEVICES='0' python -u train_softmax.py --network y1 --ckpt 2 --loss-type 0 --lr-steps 120000,140000 --wd 0.00004 --fc7-wd-mult 10 --per-batch-size 512 --emb-size 128 --data-dir ../datasets/faces_ms1m_112x112 --prefix ../models/MobileFaceNet/model-y1-softmax
to got my model。but i found this model is almost 40M. i have no idea why i got so much big model comparing to yours? PLAESE HELP ME

Answer 11 · 2018-06-01T07:18:43.000Z

@BUAA-21Li
your model is too big cause of last fc layer, before softmax layer.

Answer 12 · 2018-06-02T08:10:44.000Z

@BUAA-21Li use deploy/model_slim.py to delete last layer

Answer 13 · 2018-06-06T03:16:32.000Z

Why have you pre-trained a model with softmax loss when training MobileFaceNet with Arcface loss, but training other networks from the scratch?

Answer 14 · 2018-06-06T03:34:49.000Z

@wayen820 THANKS ! I have solved it!

Answer 15 · 2018-06-10T14:10:35.000Z

**now we get more higher accuray using my modified mobilenet network

[lfw][12000]Accuracy-Flip: 0.99617+-0.00358
[agedb_30][12000]Accuracy-Flip: 0.96017+-0.00893 .

Answer 16 · 2018-06-16T12:39:41.000Z

@youyicloud is your problem solved? my code is similar to yours and the consine distances from samples are all around -0.99,no matter positive or negative samples.

Answer 17 · 2018-06-16T16:12:08.000Z

@BUAA-21Li you can use deploy/test.py and load mobilefacenet model, then you can use the consine distance or the Euclidean Distance. It can output the right answer~

Answer 18 · 2018-06-17T07:39:25.000Z

@youyicloud thank you for your reply.Have you analyzed why your code failed getting correct result.

Answer 19 · 2018-06-28T07:32:53.000Z

In the article, you have reported results for LResNet100E-IR (for m=0.5):
LFW: 99.83 , CFP-FP: 94.04, AgeDB-30 98.08

With the Mobilenet (m=?) you report the accuracies:
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

What is the expected accuracy drop of this model on MegaFace Challenge 1 (Table 9 from the article)?

Answer 20 · 2018-09-05T08:02:43.000Z

Thanks for your code. Recently I was trying to reproduce the mobile facenet model by your instructions, yet I encountered some question as following, would you please give me some hints. (P.S. the training dataset was combined faces_ms1m_112x112 with my private dataset, using scripts like "im2rec.py", "face2rec2.py" and "dataset_merge.py".)

root@656688c713aa:/proj/insightface/src# CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir ../datasets/xl_marked --network y1 --loss-type 0 --prefix ../mobile_facenet --per-batch-size 128 --lr-steps "240000,360000,440000" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002
gpu num: 4
num_layers 1
image_size [112, 112]
num_classes 381
Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ckpt=2, ctx_num=4, cutoff=0, data_dir='../datasets/xl_marked', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0, fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_w=112, loss_type=0, lr=0.1, lr_steps='240000,360000,440000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.1, margin_s=32.0, max_steps=140002, mom=0.9, network='y1', num_classes=381, num_layers=1, per_batch_size=128, power=1.0, prefix='../mobile_facenet', pretrained='', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000, version_act='prelu', version_input=1, version_output='E', version_se=0, version_unit=3, wd=4e-05)
init mobilefacenet 1
('version_output:', 'E')
Traceback (most recent call last):
File "train_softmax.py", line 488, in
main()
File "train_softmax.py", line 485, in main
train_net(args)
File "train_softmax.py", line 334, in train_net
sym, arg_params, aux_params = get_symbol(args, arg_params, aux_params)
File "train_softmax.py", line 170, in get_symbol
embedding = fmobilefacenet.get_symbol(args.emb_size, bn_mom = args.bn_mom, version_output=args.version_output)
File "symbols/fmobilefacenet.py", line 51, in get_symbol
assert version_output=='GDC' or version_output=='GNAP'
AssertionError

Answer 21 · 2018-09-05T09:14:14.000Z

@EdwardChou add "--version-output GNAP" to argument

Answer 22 · 2018-09-06T02:57:30.000Z

@shangleyi Thanks for reply. After append "--version-output GNAP" to argument, run, and another error pop out, yet I am using the correct input size, namely 112*112 input images. This is pretty wired.

expected [3,160,160], got [3,112,112]

The complete log is as following:

root@656688c713aa:/proj/insightface/src# CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir ../datasets/marked_face_crop --network y1 --loss-type 0 --prefix ../mobile_facenet --per-batch-size 128 --lr-steps "240000,360000,440000" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002 --version-output GNAP
gpu num: 4
num_layers 1
image_size [112, 112]
num_classes 381
Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ckpt=2, ctx_num=4, cutoff=0, data_dir='../datasets/marked_face_crop', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0, fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_w=112, loss_type=0, lr=0.1, lr_steps='240000,360000,440000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.1, margin_s=32.0, max_steps=140002, mom=0.9, network='y1', num_classes=381, num_layers=1, per_batch_size=128, power=1.0, prefix='../mobile_facenet', pretrained='', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000, version_act='prelu', version_input=1, version_output='GNAP', version_se=0, version_unit=3, wd=4e-05)
init mobilefacenet 1
('version_output:', 'GNAP')
INFO:root:loading recordio ../datasets/marked_face_crop/train.rec...
header0 label [  9369.  18696.]
id2range 9327
9368
rand_mirror 1
lr_steps [240000, 360000, 440000]
call reset()
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mxnet/python/mxnet/io.py", line 396, in prefetch_func
    self.next_batch[i] = self.iters[i].next()
  File "/proj/insightface/src/image_iter.py", line 215, in next
    batch_data[i][:] = self.postprocess_data(datum)
  File "/mxnet/python/mxnet/ndarray/ndarray.py", line 437, in __setitem__
    self._set_nd_basic_indexing(key, value)
  File "/mxnet/python/mxnet/ndarray/ndarray.py", line 691, in _set_nd_basic_indexing
    value.copyto(self)
  File "/mxnet/python/mxnet/ndarray/ndarray.py", line 1876, in copyto
    return _internal._copyto(self, out=other)
  File "<string>", line 25, in _copyto
  File "/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/mxnet/python/mxnet/base.py", line 146, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [13:43:04] src/operator/nn/./../tensor/../elemwise_op_common.h:123: Check failed: assign(&dattr, (*vec)[i]) Incompatibleattr in node  at 0-th output: expected [3,160,160], got [3,112,112]

Stack trace returned 10 entries:
[bt] (0) /mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f5416c1559a]
[bt] (1) /mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f5416c16138]
[bt] (2) /mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseAttr<nnvm::TShape, &mxnet::op::shape_is_none, &mxnet::op::shape_assign, true, &mxnet::op::shape_string[abi:cxx11], -1, -1>(nnvm::NodeAttrs const&, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, nnvm::TShape const&)::{lambda(std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, unsigned long, char const*)#1}::operator()(std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, unsigned long, char const*) const+0xbf1) [0x7f5416e6da61]
[bt] (3) /mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<1, 1>(nnvm::NodeAttrs const&, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*)+0x24a) [0x7f5416e6ff7a]
[bt] (4) /mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0xb4d) [0x7f54191c0e1d]
[bt] (5) /mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x35f) [0x7f5419198d8f]
[bt] (6) /mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, charconst**)+0xe7b) [0x7f541968d4eb]
[bt] (7) /mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x3ff) [0x7f541968ecaf]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f5494337e40]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f54943378ab]



[13:43:06] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, thiscan take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/mxnet/python/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.25 vs. 0.001953125). Is this intended?
  optimizer_params=optimizer_params)
Killed

Answer 23 · 2018-09-08T08:41:16.000Z

@EdwardChou How did you prepare train.rec

Answer 24 · 2018-09-10T07:19:11.000Z

Hi, @shangleyi
This is my way to generate train.rec.

cd PROJ_DIR/src/data

download im2rec.py, modify script follow #265

# 160*160*3 -> 112*112*3
python im2rec.py --list --resize 112 --recursive ./my_data IMG_DIR

echo "100,112,112" > property

Modify line to "with open('IMG_DIR' + fullpath, 'rb') as fin:"

python face2rec2.py  . 

# Move generated dataset to PROJ_DIR/datasets/MY_DATASET
python dataset_merge.py --include "../../datasets/faces_ms1m_112x112/,../../datasets/MY_DATASET/" --output "../../datasets/MY_MERGE_DATASET/"

Answer 25 · 2018-09-10T08:33:19.000Z

@EdwardChou I used face2rec2.py directly without using im2rec.py and it worked. Maybe you should write a script which resizes the images and then use face2rec2.py directly. I'm not so sure about im2rec.py.

Answer 26 · 2018-09-10T08:45:48.000Z

training dataset: ms1m, ms1m-v2. private dataset
lfw: 99.583, cfp_fp: 95.357, agedb_30: 96.533
training process: https://github.com/shangleyi/insightface-training-note/blob/master/README.md

Answer 27 · 2018-09-11T01:56:54.000Z

@shangleyi Thanks you so much. My problem is exactly the resize function in im2rec.py doesn't work. So I resize the images with another script. Currently the training procedure following instruction above looks good. You save my day!

Answer 28 · 2018-09-21T08:58:39.000Z

Is there any training file corresponding to Caffe? I want to use Caffe training.
（有没有对应 caffe 的训练的文件，我想用caffe训练）

Answer 29 · 2018-09-30T03:01:56.000Z

dataset: emore
network backbone: mobilefacenet + GNAP block
loss function: arcface(m=0.5)
training pipeline: finetune (lr drop at 100K, 140K, 160K), batch-size:512
one epoch 52: LFW-99.60%, CFP-FP-93.46%, AgeDB-95.45%

Answer 30 · 2018-10-10T10:35:08.000Z

Hi, @nttstar I encounter some strange thing when I finetune mobile-facenet model (2nd step of 2-step pipeline) and would like to ask for your help. My training acc got stuck in 0.51~0.53 while accuracy of lfw, agedb-30 reach 95%. Similar to #187

My finetune param is like:

Called with argument: Namespace(batch_size=512, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ckpt=2,            ctx_num=4, cutoff=0, data_dir='../datasets/x', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0,    fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_w=112, loss_type=4, lr=0.1,         lr_steps='100000,140000,160000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.5, margin_s=64.0, max_steps=0, mom=0.  9, network='y1', num_classes=94491, num_layers=1, per_batch_size=128, power=1.0, prefix='../xz/xz_mobile_facenet', pretrained='../xz_mobile_facenet,70',               rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000,          version_act='prelu', version_input=1, version_output='GNAP', version_se=0, version_unit=3, wd=4e-05)

and the result is like:

 INFO:root:Epoch[145] Batch [1780]   Speed: 851.07 samples/sec   acc=0.529687
 INFO:root:Epoch[145] Batch [1800]   Speed: 866.48 samples/sec   acc=0.529980
 INFO:root:Epoch[145] Batch [1820]   Speed: 725.38 samples/sec   acc=0.519043
 INFO:root:Epoch[145] Batch [1840]   Speed: 919.19 samples/sec   acc=0.527051
 INFO:root:Epoch[145] Batch [1860]   Speed: 996.87 samples/sec   acc=0.525586
 INFO:root:Epoch[145] Batch [1880]   Speed: 1021.45 samples/sec  acc=0.521094
 lr-batch-epoch: 0.0001 1894 145
 testing verification..
 (12000, 128)
 infer time 39.693939
 [lfw][1082000]XNorm: 11.132285
 [lfw][1082000]Accuracy-Flip: 0.99517+-0.00398
 testing verification..
 (14000, 128)
 infer time 42.053231
 [cfp_fp][1082000]XNorm: 9.771846
 [cfp_fp][1082000]Accuracy-Flip: 0.88900+-0.02205
 testing verification..
 (12000, 128)
 infer time 34.666512
 [agedb_30][1082000]XNorm: 11.260081
 [agedb_30][1082000]Accuracy-Flip: 0.95383+-0.00796
 saving 541

I have seen your training log attach in baiduyun, the log shows the acc of your model reach 0.5 after 15 epoch, which is the same to my experiment result. Yet the your log stop at 24 epoch when the highest acc reach 0.55. Did you conduct further experiment to reach higher accuracy? Or there is something wrong with the calculation of training acc? Looking for your help, Thanks.

Answer 31 · 2018-11-06T11:55:15.000Z

Hi guys,
for the first step in the training pipeline, usually how many epochs do you use to get a reasonable accuracy of LFW, such as 99%?
I trained for a long time, the accuracy is always around 91%.

Answer 32 · 2019-02-18T05:25:12.000Z

My 2-stage pipeline:

Train softmax with lr=0.1 for 120K iterations.

LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002

Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.

LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

@nttstar
按照你的配置，请问训练了多长时间？达到 LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91这样的准确率。

Answer 33 · 2019-02-18T05:30:53.000Z

Hi guys,
for the first step in the training pipeline, usually how many epochs do you use to get a reasonable accuracy of LFW, such as 99%?
I trained for a long time, the accuracy is always around 91%.

@karlTUM
CPU E5-2650, v4
GPU 2x RTX2080Ti
Epoch 15, batch_size 32, lr 0.001

INFO:root:Epoch[15] Batch [32040-32060] Speed: 274.12 samples/sec acc=0.865625
INFO:root:Epoch[15] Batch [32060-32080] Speed: 272.38 samples/sec acc=0.839063
INFO:root:Epoch[15] Batch [32080-32100] Speed: 272.94 samples/sec acc=0.855469
INFO:root:Epoch[15] Batch [32100-32120] Speed: 272.41 samples/sec acc=0.839063
INFO:root:Epoch[15] Batch [32120-32140] Speed: 272.01 samples/sec acc=0.852344
INFO:root:Epoch[15] Batch [32140-32160] Speed: 267.44 samples/sec acc=0.855469
INFO:root:Epoch[15] Batch [32160-32180] Speed: 273.78 samples/sec acc=0.853125
INFO:root:Epoch[15] Batch [32180-32200] Speed: 274.96 samples/sec acc=0.851562
INFO:root:Epoch[15] Batch [32200-32220] Speed: 273.08 samples/sec acc=0.842187
INFO:root:Epoch[15] Batch [32220-32240] Speed: 273.76 samples/sec acc=0.849219
lr-batch-epoch: 0.0001 32249 15
testing verification..
(12000, 512)
infer time 25.010638999999994
[lfw][924000]XNorm: 23.051082
[lfw][924000]Accuracy-Flip: 0.99700+-0.00296
testing verification..
(14000, 512)
infer time 29.09600100000001
[cfp_fp][924000]XNorm: 23.878208
[cfp_fp][924000]Accuracy-Flip: 0.92786+-0.01553
testing verification..
(12000, 512)
infer time 24.954134000000025
[agedb_30][924000]XNorm: 23.627240
[agedb_30][924000]Accuracy-Flip: 0.97650+-0.01031
saving 462

Answer 34 · 2019-02-18T05:49:21.000Z

512

similar issue.
我这边epoch 17, acc已经达到了0.9，但后面提升就很慢了
请问：

CPU型号是？几颗？
显卡型号是？用了几张？

My 2-stage pipeline:

Train softmax with lr=0.1 for 120K iterations.

LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002

Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.

LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

@nttstar
请问log文件是怎么自动生成的呢？谢谢~

Answer 35 · 2019-07-04T10:25:21.000Z

Hi, @shangleyi
This is my way to generate train.rec.

cd PROJ_DIR/src/data

download im2rec.py, modify script follow #265

# 160*160*3 -> 112*112*3
python im2rec.py --list --resize 112 --recursive ./my_data IMG_DIR

echo "100,112,112" > property

Modify line to "with open('IMG_DIR' + fullpath, 'rb') as fin:"

python face2rec2.py  . 

# Move generated dataset to PROJ_DIR/datasets/MY_DATASET
python dataset_merge.py --include "../../datasets/faces_ms1m_112x112/,../../datasets/MY_DATASET/" --output "../../datasets/MY_MERGE_DATASET/"

Hi, have you managed to get correct merged dataset?
We also tried to merge the two datasets: faces_emore and faces_glint with dataset_merge.py with the following code:
python dataset_merge.py --include /home/ti/Downloads/DATASETS/faces_emore,/home/ti/Downloads/DATASETS/faces_glint --output /home/ti/Downloads/DATASETS/merge --model /home/ti/Downloads/insightface/models/model-r100-ii/model,0
But after the merge completed the resulting dataset had the same property and .rec and .idx sizes as faces_emore dataset.
What is wrong with our parameters?

Thank you!

Answer 36 · 2019-07-04T15:11:04.000Z

It has been a year and I can hardly remember what did I do, but did you try adding the quotation marks?

Answer 37 · 2019-07-04T17:36:29.000Z

Trained mobileFaceNet on emore, here is the result:

Called with argument: Namespace(batch_size=224, beta=1000.0, beta_freeze=0, beta_min=5.0, bn_mom=0.9, ce_loss=False, ckpt=1, color=0, ctx_num=1, cutoff=0, data_dir='../datasets/faces_emore', easy_margin=0, emb_size=128, end_epoch=100000, fc7_lr_mult=1.0, fc7_no_bias=False, fc7_wd_mult=10.0, gamma=0.12, image_channel=3, image_h=112, image_size='112,112', image_w=112, images_filter=0, loss_type=4, lr=0.1, lr_steps='200000,280000,320000', margin=4, margin_a=1.0, margin_b=0.0, margin_m=0.5, margin_s=64.0, max_steps=0, mom=0.9, network='y1', num_classes=85742, num_layers=1, per_batch_size=224, power=1.0, prefix='../models/y1-arcface-emore/model', pretrained='../models/y1-softmax-emore/model,234', rand_mirror=1, rescale_threshold=0, scale=0.9993, target='lfw,cfp_fp,agedb_30', use_deformable=0, verbose=2000, version_act='prelu', version_input=1, version_multiplier=1.0, version_output='E', version_se=0, version_unit=3, wd=4e-05)

testing verification..
(12000, 128)
infer time 5.607243
[lfw][346000]XNorm: 11.406996
[lfw][346000]Accuracy-Flip: 0.99600+-0.00442
testing verification..
(14000, 128)
infer time 6.47071
[cfp_fp][346000]XNorm: 9.418514
[cfp_fp][346000]Accuracy-Flip: 0.94729+-0.01445
testing verification..
(12000, 128)
infer time 5.542683
[agedb_30][346000]XNorm: 11.237676
[agedb_30][346000]Accuracy-Flip: 0.96300+-0.00942

Answer 38 · 2019-07-10T09:32:09.000Z

What does accuracy_flip mean? Does it have to do with using features of flipped images during training?(as described in one of the mobileface papers?)
Or flipping during post processing while calculating embedding distance?

Answer 39 · 2019-07-23T02:44:29.000Z

@nttstar你好，如何finetune自己的数据，能提供保护fc7的预训练模型吗？

Answer 40 · 2019-11-13T10:00:50.000Z

My 2-stage pipeline:

Train softmax with lr=0.1 for 120K iterations.

LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002

Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.

LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

your max-steps is 140002(140k), but you said 120k, lr-steps is 240000(240k), 360000(360k), ... , which is right?

Answer 41 · 2020-05-18T03:37:39.000Z

dataset: emore
network backbone: mobilefacenet + GNAP block
loss function: arcface(m=0.5)
training pipeline: finetune (lr drop at 100K, 140K, 160K), batch-size:512
one epoch 52: LFW-99.60%, CFP-FP-93.46%, AgeDB-95.45%
@erichouyi

what's your acc on train data

Answer 42 · 2020-11-04T08:13:46.000Z

My 2-stage pipeline:

Train softmax with lr=0.1 for 120K iterations.

LRSTEPS='240000,360000,440000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 0 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 32.0 --margin-m 0.1 --ckpt 2 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --max-steps 140002

Switch to ArcFace loss to do normal training with '100K,140K,160K' iterations.

LRSTEPS='100000,140000,160000'
CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train_softmax.py --data-dir $DATA_DIR --network "$NETWORK" --loss-type 4 --prefix "$PREFIX" --per-batch-size 128 --lr-steps "$LRSTEPS" --margin-s 64.0 --margin-m 0.5 --ckpt 1 --emb-size 128 --fc7-wd-mult 10.0 --wd 0.00004 --pretrained '../models2/model-y1-test/model,70'

Pretrained model: baiduyun
training dataset: ms1m
LFW: 99.50, CFP_FP: 88.94, AgeDB30: 95.91

Which version of ms1m did you use? I trained mobilefacenet with the ms1m-refine-v1 dataset and the same config (except that I used 2 GPUs with per_batch_size=256) but the maximum accuracy on LFW in 180K iterations was 0.99400.

Answer 43 · 2020-12-23T11:54:14.000Z

@bahar3474 hello,excuse me ,Where is the train_SOFTmax file? There is no such file in the branch of the new version

Answer 44 · 2020-12-23T13:33:19.000Z

@CasonTsai
Hi. I used this version of code:
https://github.com/deepinsight/insightface/blob/08265c749a7af6f1d7e9057df55a3eb2b171ddcb/src/train_softmax.py
Two months ago they refined the repo structure and I don't know where you can find it in new version.