CSAILVision/GazeCapture

Inference on webcam

DaddyWesker opened this issue · 18 comments

Hello. Thank you for sharing youor code.

I'm currently is trying to launch your pytorch code on webcam. As i understand, i need to firstly detect face and both eyes on the frame and then launch model on that data and i can put anything as y-data since i'm only want to evaluate, not train. But one question still remains - how to get faceGrid? What this array contains and is it possible to get it somehow?

Hi, the face grid is basically just the entire photograph rescaled to 25x25 pixel resolution where the pixels where the face is are white and the others are black (as a binary mask).

Thank you for your answer. Will try to infer on webcam in a several days and will close the issue when i'm done.

@erkil1452 can you please tell also what is the output of the model? I mean, there are two numbers. Is this some point on the image where are eyes are looking at or lat and lon of gaze?
Is it possible that output will be negative?

The output are XY coordinates of the fixation on the screen plane assuming the camera is orthogonal to the screen and located in the screen plane (as front facing camera in a phone). The coordinates are relative to the camera location, X point right and Y points up, If you camera is not in the left top corner, you will see negative coordinates.

Hm. Any suggestions how to transform those coordinates to image plane? I'm actually don't get that part with "camera in the top left corner". Top left corner of what? You mean camera's coordinate system?

Take a look at Figure 6 in the paper. The hole in the middle is the camera and the rectangles around are the iPad (bigger) and iPhone (smaller orange) displays at 4 different orientations (only 3 for iPhone). The X and Y axis are exactly the numbers you get from the network.

Hm. If i got you right, X and Y is the point in real world relative to camera. For example, i'm recieveing [-62, 33] as an output for my webcam frame. In that case, that means that i'm looking like 62 cm left and 33 top of the camera? Like, in the top-left relative to camera?

That seems way too much. The model is trained using iPhone and iPad so the range of predicted gaze should stay within +/- 20 cm. I suggest checking if the image normalization works properly (e.g. not subtracting float bias from uint8 image or the other way round). Some people also reported that their webcam did not perform well while running the model with actual iPhone camera fixed the issue.

Hello everyone and thanks for the work done here.

@DaddyWesker Did you find a solution? I would love to see your code for the inference on camera.

BR

@adrienju

Well, it seems that i've forgot image normalization. So i'm currently is trying to apply it as in the main.py of this repo. I'll get back here whem i'm done.

Well, here is what my code is look like now. After applying of normalization, i was able to get output numbers in [-10, 10] range. Unfotunately, i don't know how to properly transwer those cm to the image plane (consider current imshow code as a stub)... SO i can't understand if it works correctly.

eval_on_webcam.txt

If your device differs from the iPad/iPhone then you probably have the best chance by acquiring few calibration measurements (look at a known position on the screen, record that position and compare to the network prediction) and then fitting a simple transformation model that will transform the predictions to a physical space. E.g. a simple 4x4 transformation matrix using Opncv.solvePnP.

AmKhG commented

@erkil1452 Thanks for the awesome work. Is the Pytorch checkpoint trained on the full dataset? When trying the Pytorch checkpoint on my webcam, I understand we need to register the output values to laptop camera coordinate system, but shouldn't the number be "relatively" correct. For instance if I look at 4 opposite corners of the screen, shouldn't the number change accordingly? What am I missing here.

The model has been trained on the full dataset but the dataset only contains images from iPhone/iPad cameras. I have tested the model with a high quality external USB Webcam and that directionally correlated quite well but I saw a report of somebody running into issues with a low quality embedded camera.

Hi @erkil1452, I didn't understand how could I use the OpenCV.solvePnP for the purpose of calibration. Could you elaborate more on what kind of a model is this transformation matrix or model and how could we train/fit this model? Could you recommend something other than OpenCV.solvePnP?

Sorry, wrong method. I think you probably do better if you just write the transformation between the network output and desired GT gaze points as a linear problem and solve it using least square.

I have a thought to share. GazeCapture dataset have orientation to define the orientation of the interface. Though the orientation is not take into model, they can use it to convert the prediction of model(x,y) to specific points on pad or iPhone. But when we use the model in PC and the orientation is fixed on 1 (portrait), model can predict the points with thinking orientation is maybe 2,3,4. Because of unknowing the orientation, we can convert the points into the right place. Is my thought right?

The screen orientation has been taken into account when making the dataset. Effectively, our gaze location [x,y] is always given wrt. the camera lens (assuming the camera lens is in the display plane). That is [0,0] means person looks into the camera (which cannot happen in our dataset design). Together, all 4 orientation create an expanded display area in shape of a cross (see histograms in the paper). What it means is that for another display geometry (such as PC), you could just measure XY offset between your camera and e.g. pixel [0,0] and add it to the prediction. However, that will only hold as long as your camera parameters (FOV, color tone, ....) are within the range of parameters of the iPad/iPhone cameras used for the training. If your camera images look somehow "different", the network is forced outside of its comfort zone and it is not really defined what it does. Hope is, that it still maintains semi-linear relation between the predicted and true gaze so a simple linear transformation (add some, multiply some) can be calibrated (even if by manual tweaking) to bring the predictions into match with reality. In my experience this works ok-ish. You can of course go further and train a small fully connected network (e.g. 2 layer, 8 neurons each) to do more generic transformation but you will need more and more calibration data (prediction-gt pairs collected on your setup). In case your camera is far away from the screen it generally helps if it sees the face as frontal as possible. Ideally one should check the full frames in our dataset and compare them (visually) with their own camera images to get some idea what may be source of issues.