Could anyone please show me the whole process to make your own dataset?

Question

Could anyone please show me the whole process to make your own dataset?

YokkaBear opened this issue 5 years ago · 18 comments

I am trapped in the big confusion on how to make a dataset that can be used to train arcface model out of my own face data directory, in the structure as below:

.
├── PersonA
│   ├── 001.jpeg
│   └── 002.jpeg
├── PersonB
│   ├── 001.jpeg
│   └── 002.jpeg
└── PersonC
    ├── 001.jpeg
    ├── 002.jpeg
    └── 003.jpeg

3 directories, 7 files

It would be better to show me the steps in detail in the following form:

...
...
...
...

Much appreciation and thanks!

Answer 1 · 2019-07-12T04:45:56.000Z

TBH, this is the most implicit dataset-making procedure I've ever seen, no one has ever shown me clearly or briefly how to make up ones' own complete available dataset including the writer. Really call for some clear ideas though...

Answer 2 · 2019-07-13T09:42:44.000Z

Hi @YokkaBear,
I think you have to search other places or read papers of LFW, YTF dataset creators to get an understanding of how datasets are collected and refined, the issues here are mostly about insightface project and not about some dataset, because they've used merged or publicly available datasets (you can read their paper (https://arxiv.org/pdf/1801.07698.pdf)).
But to clear some points you can follow the next steps (regarding this project):

You have to collect images of people: one folder for images of one person (the images can be of different quality, having different lightning conditions, from different cameras, sources, etc.);
You have to refine the folders you have: delete/merge duplicate folders (having the same person images), because it will affect the accuracy of your training;
After first two steps you should have one folder with subfolders having the structure like you provided above;
Align your dataset with the size of the images 112*112: for this you can use facenet's alignment script (https://github.com/davidsandberg/facenet/blob/master/src/align/align_dataset_mtcnn.py) or if it doesn't work for you, you can use my script (https://github.com/Talgin/preparing_data/blob/master/align_dataset_mtcnn_v1.py), which is the revised version of facenet script - some libraries were out of date, so I had to change them.
Then... you have to divide your dataset into Train, Validation, Test sets (we used 80%,10%,10% ratio);
Then you have to create .lst file of you dataset: you can use our script (https://github.com/Talgin/preparing_data/blob/master/insightface_pairs_gen_v1.py);
Generate .rec and .idx files using: face2rec2.py (insightface/src/data/face2rec2.py);
Generate pairs.txt;
Using pairs.txt generate .bin file using lfw2pack.py (https://github.com/deepinsight/insightface/blob/master/src/data/lfw2pack.py) - bin file is needed for validation;
Collect .rec, .idx, .bin, property files into one folder and start training.

You can also read some info on my page (I'll rewrite/restructure it in 1-2 weeks): https://github.com/Talgin/preparing_data

P.S. Codes provided above are my publicly available codes and codes shared in github by other people.

Good luck! :)

Answer 3 · 2019-07-14T08:43:02.000Z

Hi @Talgin ,

My greatest gratitude to your kindly guide! I was going through this instruction step by step since early today, and I'm at step 8 up to now. And at this step, I had a little problem about pairs.txt, how is it generated?

As I went through your preparing_data project, I found that gen_pairs_lfw.py is responsible for generating pairs.txt, however I didn't find the file in either your project or the insighface original project. Could you show me how pairs.txt is generated? Much thx again!

Answer 4 · 2019-07-17T07:57:37.000Z

@YokkaBear
See this link . Might help u out
https://github.com/VictorZhang2014/facenet/blob/master/mydata/generate_pairs.py
The following link is better and tested:
https://github.com/armanrahman22/Facial-Recognition-and-Alignment/blob/master/facenet_sandberg/generate_pairs.py

Answer 5 · 2019-07-22T03:19:59.000Z

Finally I have gone through all steps mentioned by @Talgin , thank you a lot, sir.
However, when I use the .bin file for training, I met a confusing problem as follows:

Traceback (most recent call last):
  File "train.py", line 380, in <module>
    main()
  File "train.py", line 377, in main
    train_net(args)
  File "train.py", line 372, in train_net
    epoch_end_callback = epoch_cb )
  File "/data/wangyoujia/anaconda2/envs/my_env37/lib/python3.6/site-packages/mxnet/module/base_module.py", line 560, in fit
    callback(batch_end_params)
  File "train.py", line 308, in _batch_callback
    acc_list = ver_test(mbatch)
  File "train.py", line 277, in ver_test
    acc1, std1, acc2, std2, xnorm, embeddings_list = verification.test(ver_list[i], model, args.batch_size, 10, None, None)
  File "eval/verification.py", line 290, in test
    _, _, accuracy, val, val_std, far = evaluate(embeddings, issame_list, nrof_folds=nfolds)
  File "eval/verification.py", line 185, in evaluate
    np.asarray(actual_issame), 1e-3, nrof_folds=nrof_folds)
  File "eval/verification.py", line 143, in calculate_val
    _, far_train[threshold_idx] = calculate_val_far(threshold, dist[train_set], actual_issame[train_set])
  File "eval/verification.py", line 173, in calculate_val_far
    far = float(false_accept) / float(n_diff)
ZeroDivisionError: float division by zero

And After searching for the previous issues, I found that #682 has mentioned this problem.

However, both the writer and the launcher did not show any implementation details on how to fix the problem, so I have to try different solutions blindly. I have tried modifing the pairs.txt as the writer shows in pairs_label.txt (i.e. the 0/1 face pair switch pattern), I have tried using different validation dataset from the training dataset, but all the solutions failed.

I am eager to hear from someone who has gone through validation procedure using the dataset made on his/her own.

Thank you again.

Answer 6 · 2019-07-30T11:14:56.000Z

Dear @YokkaBear ,
Sorry for the late reply, we've had hard time working on how to run training on powerpc machine (with Teslas). Regarding your question... We also had the same problem, this is because insightface (like facenet also) uses 10-fold cross-validation. I suppose you have created bin files out of only 1-fold validation mix, that's why it throws you this exception. I've uploaded the file gen_pairs_lfw.py to the above directory (if I'm not mistaken I also took it from github and modified slightly, but do not remember where:)).
But you also can use the scripts from the links provided by @Incline-ArmughanShahid, the last script has a parameter where you can choose number of folds (in our case 10 I suppose).

Regards,

Answer 7 · 2019-07-31T06:26:52.000Z

Hi @YokkaBear,
I think you have to search other places or read papers of LFW, YTF dataset creators to get an understanding of how datasets are collected and refined, the issues here are mostly about insightface project and not about some dataset, because they've used merged or publicly available datasets (you can read their paper (https://arxiv.org/pdf/1801.07698.pdf)).
But to clear some points you can follow the next steps (regarding this project):

You have to collect images of people: one folder for images of one person (the images can be of different quality, having different lightning conditions, from different cameras, sources, etc.);

You have to refine the folders you have: delete/merge duplicate folders (having the same person images), because it will affect the accuracy of your training;

After first two steps you should have one folder with subfolders having the structure like you provided above;

Align your dataset with the size of the images 112*112: for this you can use facenet's alignment script (https://github.com/davidsandberg/facenet/blob/master/src/align/align_dataset_mtcnn.py) or if it doesn't work for you, you can use my script (https://github.com/Talgin/preparing_data/align_dataset_mtcnn_v1.py), which is the revised version of facenet script - some libraries were out of date, so I had to change them.

Then... you have to divide your dataset into Train, Validation, Test sets (we used 80%,10%,10% ratio);

Then you have to create .lst file of you dataset: you can use our script (https://github.com/Talgin/preparing_data/insightface_pairs_gen_v1.py);

Generate .rec and .idx files using: face2rec2.py (insightface/src/data/face2rec2.py);

Generate pairs.txt;

Using pairs.txt generate .bin file using lfw2pack.py (https://github.com/deepinsight/insightface/blob/master/src/data/lfw2pack.py) - bin file is needed for validation;

Collect .rec, .idx, .bin, property files into one folder and start training.

You can also read some info on my page (I'll rewrite/restructure it in 1-2 weeks): https://github.com/Talgin/preparing_data

P.S. Codes provided above are my publicly available codes and codes shared in github by other people.

Good luck! :)

My greatest gratitude to your kindly guide! But when I use my own dataset (total 20000 images)to create the lst and rec file, I am confused. My train.lst like
1 /home/duckj/.../1.jpg 0
1 /home/duckj/.../2.jpg 0
...
means aligned \t path\t label, and copy into path src/data/
Then I make a property file ,means 9980 classes in the dataset,and the aligned size is 112 112.
9980,112,112
when I use src/data/face2rec2.py ,run
python face2rec.py .
I get the train.rec and train.idx, but in the train.idx, it has 29981 rows, why is not 20000?
and when i run recognition/train.py, why the log print the id2range more than 9980?https://github.com/deepinsight/insightface/blob/master/recognition/image_iter.py#L58
Where did I go wrong?
Can you give me some advices

Answer 8 · 2019-08-02T07:19:32.000Z

I have solved this problem，the label in lst must be incremental.
idx rows will also exceed 9980, but that doesn't matter.

Answer 9 · 2019-08-10T03:37:39.000Z

@Incline-ArmughanShahid @Talgin I had a problem generating pairs.txt files for my dataset. I don't know how to set --num_matches_mismatches.Could you help me?
parser.add_argument('--num_matches_mismatches',

                    type=int,

                    required=True,

                    help='Number of matches/mismatches per fold.')

Answer 10 · 2019-10-09T07:45:03.000Z

What's the property?

Answer 11 · 2019-10-09T08:59:56.000Z

Hi @YangYangGirl ,
Property is a file that contains num_of_classes, im_size, im_size.
E.g. if you have 10000 identities you write 10000,112,112 in your property file. Look at one of the example datasets from DatasetZoo (https://github.com/deepinsight/insightface/wiki/Dataset-Zoo) they contain property file.

Answer 12 · 2019-10-09T09:28:52.000Z

谢谢，我现在已经跑通辣！

Answer 13 · 2019-10-16T01:48:07.000Z

hi, I have some problems generating my own dataset. When I first run python dir2lst.py /facedata > Face.lst, I got a .lst file.
Then I applied face2rec2.py to generate .rec and .idx file: python face2rec2.py --num-thread 16 Face,
the problem is that the generated .idx length is twice of the .lst file, should they be the same length?

Answer 14 · 2019-10-19T10:19:10.000Z

HI,

I am new to this. How to do the training using Arcface pretrained model and recognize multiple faces

Answer 15 · 2020-04-25T13:16:53.000Z

@Talgin Hi, in your code to generate non-pair data for lst file in step 6
https://github.com/Talgin/preparing_data/blob/231bc15432f4ca810e384147e67d6bc928636f33/insightface_pairs_gen_v1.py#L105

you just random choose two item in the list
https://github.com/Talgin/preparing_data/blob/231bc15432f4ca810e384147e67d6bc928636f33/insightface_pairs_gen_v1.py#L87

but it is possible to get pair data in this way(left and right from the same person), dose it matter in practice?or we can just ignore it?

Answer 16 · 2020-12-08T07:35:24.000Z

Did not find and list file with prefix D:\insightface\datasets\train
getting this error while running face2rec.py
can any one help me to solve this error?

Answer 17 · 2020-12-31T06:35:34.000Z

len(ver_list): 1 testing verification.. (3110, 512) infer time 54.65118900000002 Traceback (most recent call last): File "train.py", line 402, in <module> main() File "train.py", line 398, in main train_net(args) File "train.py", line 387, in train_net epoch_end_callback=epoch_cb) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/module/base_module.py", line 553, in fit callback(batch_end_params) File "train.py", line 323, in _batch_callback acc_list = ver_test(mbatch) File "train.py", line 288, in ver_test None, None) File "eval/verification.py", line 291, in test _, _, accuracy, val, val_std, far = evaluate(embeddings, issame_list, nrof_folds=nfolds) File "eval/verification.py", line 184, in evaluate np.asarray(actual_issame), 1e-3, nrof_folds=nrof_folds) File "eval/verification.py", line 147, in calculate_val _, far_train[threshold_idx] = calculate_val_far(threshold, dist[train_set], actual_issame[train_set]) File "eval/verification.py", line 170, in calculate_val_far val = float(true_accept) / float(n_same) ZeroDivisionError: float division by zero
Getting this error when we run the verification.py file
recognition/eval/verification.py
could any one help me what's going wrong here??
Anyone can help me out how to fix it??
Thank you for your support 👍

Answer 18 · 2021-06-29T13:41:43.000Z

Hi, @YokkaBear Are you testing on MORPH, FGNET, and CACD-VS datasets? I require some urgent help, please