some question about gaze estimation
Closed this issue · 14 comments
@erkil1452 ,hello, I want to ask a question:
now I only have a RGB camera, I use this camera to capture people face image, how can I get gaze vector(pitch, yaw)? Is there some method or some paper to reference? thanks in advance!
It should be enough to run the model as provided to get a 3D vector of the gaze [x,y,z] in the coordinate system described in the supplement. You can then convert it to spherical coordinates to get the yaw and pitch. It will be probably something like:
yaw = atan(x' / z')
pitch = asin(y')
where
[x',y',z'] is the normalized predicted gaze vector [x,y,z] / || [x,y,z] ||
@erkil1452 I am sorry, I'm not clear express the question. I want to know how to make my own dataset, then I can use my own dataset to train the model. Now I only have a RGB camera. I use this camera to capture people face image, then how to label this image? give the image a label(gaze vector(pitch, yaw)) to form my own dataset.
or Are there some methods to get the gaze vector(pitch, yaw) only use the RGB image through deep learning method or non deep learning method? Is there some paper to reference? Thank you!
Ok, now I am not entirely sure I fully understand. You say you want to collect your own dataset to train a neural network. Do I assume correctly you want to train >>our<< neural network? That means you want to reproduce our dataset collection procedure?
If all you have are RGB images of people than there is not really any way how to obtain ground truth gaze labels. In our paper, we describe that we told people where to look (so we knew where they are looking) and we tracked both their positions and the position of the gaze target in 3D space. That allowed us to trivially compute the ground truth gaze labels. If you have not asked your subjects to look at a specific point or you have not tracked that point then I do not not see a way how to obtain ground truth labels.
Of course, you can use any deep learning method for gaze prediction to label your dataset with approximate labels. In that case you can use our model. The input to our model is an image containing the head - it is a square crop of the head area. It can be detected by either any body pose estimator. We used OpenPose, AlphaPose and DensePose at different stages of the project. However, keep in mind that whatever deep learning method you will use to produce the gaze labels, they are not ground truth labels. There might always be lot of noise.
@erkil1452 Thank you for your answer. As you said, If I use your model to predict my own image, it will be lot of noise. so can I use the method like that "the 2D eye feature regression method", the detail are as follows:
Ec is the center of the eyeball, Oc is the center of the iris, P1 and P2 are the inner and outer canthus, U1 and U2 are the junction of upper eyelid and iris. and d passes through the center of the eye and the center of the iris vector.
Then can we consider d as the gaze vector, and use the formula
yaw = atan(x' / z')
pitch = asin(y')
to get the yaw and pitch value?
Do you know other method like that can solve the problem: only use a RGB image to get the gaze vector(pitch, yaw)?
Thank you!
What you describe is a feature based gaze prediction. I believe that unless you have a very high resolution image, there will be a lot of noise as well. Also, the relative position of iris and canthus does not account for head motion. So this will only work for a head fixed in a headrest and only after you calibrate it using known calibration points. If that is your setup then go for it.
However, if what you want is a general predictor of gaze then I am afraid you cannot do that much better than using model similar to ours. There are many newer papers that report some degree of improvement in accuracy but the larger picture stays the same. If you can, use an active eye tracker with infrared emitters and sensors. These allow the system to determine relative orientation of the eye wrt. sensor so the head pose and rotation is not important anymore. You can find such trackers with vendors such as PupilLabs, Tobii or SMI.
@erkil1452 Thank you for your answer! Let me elaborate on my problem. I have a RGB camera and a IR camera installed in the car, These camera are used to monitor the diriver. Then I want to do some gaze estimation to monitor the state of the driver. Now I want to collect my own dataset to train the neural network. but I don't know how to do? I spend a long time to find the solution to this question on the internet, but nothing found! Do you have some suggestions about that? If possible, could you introduce the data collection and annotation process in detail? whether I need to prepare a object like a ping-pong, and require people to stare the ping-pong, and then use the camera to capture the image. Then how to label the gaze ground-truth, I don't have any idea. can you give me some suggestions?
Here are some of my wild thoughts.
- In the paper "Appearance-Based Gaze Estimation in the Wild", the author descripe the process of how to collect the dataset name MPIIGaze, the author use a monocular RGB camera like us, they use a mirror based calibration method descriped in "Camera pose estimation using images of planar mirror reflections" to compute the target 3d coordinate symbol as ft, and use face detection and facial landmark detection methods to locate landmarks in the input image, and a generic 3D facial shape model ,use the EPnP algorithm to compute the rotation vector and translation vector, then can covert 2d coordinate to 3d coordinate, and get the 3d eye coordinate symbol as fe, then the gaze direction = fe - ft. Can I use this method descriped above to label my own dataset?
- I want to try the feature based gaze prediction descriped above. use the deep learning method to get the position of the center of the eyeball, and the center of the iris, then to estimate the gaze, whether this can work for low resolution image?
- If I want to collect a dataset like yours, whether I must have a 3d camera with depth? then I can get the target 3d coordinate and eye 3d coordinate, then compute the gaze ground truth.
Thank you!
Ok, I get it now. This is getting a bit beyond the scope of our project. Let me just give some brief comments and notes.
- MPIIGaze is annotated based on known location on the screen:
Every 10 minutes the software automatically asked participants to look at a random sequence of 20 on-screen positions (a recording session), visualised as a grey circle shrinking in size and with a white dot in the middle. Participants were asked to fixate on these dots and confirm each by pressing the spacebar once the circle was about to disappear.
You could do something like that as well. It does not matter if you use screen or a ping-pong ball (like in https://www.idiap.ch/en/dataset/eyediap ). What matters is that you need to determine 3D position of that target (ball or screen) in the camera view. You also need to do the same for the person. For the ball I can imagine using multiple calibrated cameras (see multiview stereo camera calibration) and triangulating the ball position. For display you can use some kind of markers (Aruco codes, April codes) that you can track also with multicamera. Then you ask a person to look at your target, confirm that they do and grab a picture. You move the target and repeat. In our setup we used continuous tracking and relied on human ability to pursue target quite well but you need to make sure your cameras are time synchronized which is not trivial. Generally speaking, look into multiview camera capture.
For getting 3D position of the person you can either rely on face scale as a cue (MPIIGaze uses that I believe) or you can use additional depth camera (eg Kinect Azure).
Instead of the multicamera setup you can use the mirror based calibration if your layout allows for it (like in MPIIGaze). That means that your camera should be close to the places where people will be looking. I do not think it would work with a ball target though, it is only useful for a display because it is fixed to the camera and the relative transformation between the camera and the display will not change after the calibration.
- Yes, it should work in principle. There can still be fail cases but that is unavoidable.
- Yes, you can use depth to fully track either the face or even the target (e.g. ball). In theory you can track both with a single camera but they typically have a limited FOV so it may be tricky to fit everything into the view. Also, the accuracy will drop with distance, plus you are limited by space in the car so you cannot easily just move the camera far away to fix it.
@erkil1452 I think I got what you said. May be I should imitate MPIIGaze or eyedisp or yours dataset collecting process to collect the trainging data and label them in the car use multicamera and when the neural network model infer, we only use only one RGB camera to capture the image. I am not sure whether the camera model in training data collecting process and the camera model used in the neural network model infer process shoud be the same? and another thing I want to confirm is whether the training data must be collect in the car? Can we collect the training data outside the car, then use these data to train the neural network, and infer the model use the image captured in the car? Thank you very much!
The training cameras, their placement and the rest of the setup should be as close as possible to your application.
You can record the dataset outside of car with a different camera but you should make sure the view angles are similar to those that will be used in the car and that the illumination is the same etc. That is quite difficult to ensure in practice, so you may rather want to vary these variable to such extent that the car conditions are safely covered within the variation space.
@erkil1452 Recently, I noticed an intersting work: "PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation", I don't know whehter you have seen this paper. In this paper, it said "We propose a domain-generalization framework for gaze estimation. Our method is only trained in the source domain and brings improvement in all unknown target domains. The key idea of our method is to purify the gaze feature with a self-adversarial framework." I want to know whether I can use the common dataset such as your gaze360 dataset to train the pureGaze model, and then infer the model directly use my own image captured in the car without collecting my own dataset to train the model?
@erkil1452 hello, you mentioned "For the ball I can imagine using multiple calibrated cameras (see multiview stereo camera calibration) and triangulating the ball position". because I am a new to this, I'm not understand fully what that mean, and I don't know what should I do to get the ball 3d coordinate, can you give me some references or some web links about introducing this method?
and I want to confim something. Is the multiview camera like that?as shown in the figure, multiple RGB cameras:
or the camera used in your paper
Thank you very much!!
@erkil1452 I have another idea. whether I can use a torf camera and a RGB camera, capture the image at the same time, use the image captured by torf camera to compute the target 3d coordinate and person 3d coordinate, then compute the gaze ground truth. then use the image captured by RGB camera and the gaze ground truth computed above to train your model, do you think it is a viable solution?
And whether there are some methods to get pitch, yaw, roll values through the predicted gaze? Above you have already tell me how to calculate the pitch, yaw values, the another value roll how to compute?
-
I think you can use PureGaze with any dataset. You will have to also apply it to your new data in order to make the method work. Based on what they promise it should help you to transfer from Gaze360 training data to your setup but there may be a limit to what it can do. I do not think it will help the method generalize to completely unseen angles. E.g. Gaze360 dataset has no picture of people from above so that would still not work.
-
Yes, your image is correct. You can refer to the Szeliski's famous (free) book on Computer Vision, Chapter 12 "3D reconstruction".
-
Yes, that is feasible if your scene fits into the camera view. Time of flight cameras (that is what you meant?) often have quite small fov (experience from working with Kinect Azure).
If you consider gaze to be a 3D vector, it does not make sense to speak about roll because line in 3D only has 5 degrees of freedom (xyz position of a point on the line and two orientation angles). If you refer to roll of the head itself, that is a different type of task entirely.
Thank you very much! I will practice what you said.