What is the size of the training patches? 128×128 or 64×64?

Hello, I want to know the size of the training patches for training the LIFT network for the 3 subtasks.
For DESC, are the patches of 128×128？ or 64×64?
And what about the latter 2 subtasks?

128x128 for the detector, then a 64x64 patch is cropped from it for the other two tasks (

tf-lift/config.py

Line 64 in 5341909

net_arg.add_argument("--ori_input_size", type=int, default=64, help="")

).

So for the first training part DESC, you selected the patches of 64×64 to start the training?
But I was confused about the next training procedure:
Then in ORI part, patches were still 64×64 and finally KP part the patches were changed to 128×128? Is it right? Seems not.
Did the patches get used in the latter 2 parts?

Patches are cropped around sift keypoints with the sift scale, and then we take a larger context (IIRC 2x on each side, so 4x area) to give the detector more context. Once a point is selected, we crop 64x64 patches from the 128x128 patch. This is all explained in Fig 2 and the accompanying text (look at the blue squares for a clear illustration):

I still have a question about your reply.
You trained LIFT network with sequence as: DESC->ORI->KP. Is it right? I think it is right.
Then I describe your operation for training in my opinion and I wish you can point out the errors of my understanding.
Thank you in advance.

First you generate the training data as: groups of 4-patches as size of 128×128 from VSFM.
Next you cropped 64×64 size patches from P1-P2-P3 patches for DESC training.(So is the keypoint location from VSFM in the center of the cropped 64×64 patches? I wonder the crop center selection)
Then you rotated the 64×64 P1-P2-P3 data and did ORI training.
Finally as for KP part training, I got stuck. The output of ORI part is 64×64 patches but the KP requires 128×128.
How did you complete the size transform? Or did I understand something wrong?

I treat your LIFT training data flow as: from Right to Left so I can not understand the KP part. I got confused.
Sorry for bothering you with the problem.

It was descriptor first, then orientation with the already trained descriptor, and IIRC desc was pre-trained separately and then everything fine-tuned together. You have the details in the paper.

When training only the descriptor or orientation/descriptor combo, we crop 64x64 patches from the larger 128x128 patch with a spatial transformer at vsfm keypoints (location, scale, and if necessary orientation), or according to the decisions of each module when training them all together. So if the SIFT scale (patch extent) is sigma, we crop a patch at 2 x sigma (to give it more context), then resize it to 128x128, and use it as-is for the detector. For the orientation/descriptor modules we crop patches at sigma (not 2 x sigma) and resize them to 64x64 instead. This is clearer if you look at the figure (see the square drawn with a dashed line, which is the 128x128 patch, and the smaller squares, which are the cropped patches).

Edit: the paper says 6 x sigma and 12 x sigma instead of sigma and 2 x sigma, because of the convention followed by OpenCV.