Proctoring-AI

Project to create an automated proctoring system where the user can be monitored automatically through the webcam and microphone. The project is divided into two parts: vision and audio based functionalities. An explanation of some functionalities of the project can be found on my medium article.

Prerequisites

For vision:

Tensorflow>2
OpenCV
sklearn=0.19.1 # for face spoofing. 
The model used was trained with this version and does not support recent ones.

For audio:

pyaudio
speech_recognition
nltk

Vision

It has six vision based functionalities right now:

Track eyeballs and report if candidate is looking left, right or up.
Find if the candidate opens his mouth by recording the distance between lips at starting.
Instance segmentation to count number of people and report if no one or more than one person detected.
Find and report any instances of mobile phones.
Head pose estimation to find where the person is looking.
Face spoofing detection

Face detection

Earlier, Dlib's frontal face HOG detector was used to find faces. However, it did not give very good results. In face_detection different face detection models are compared and OpenCV's DNN module provides best result and the results are present in this article.

It is implemented in face_detector.py and is used for tracking eyes, mouth opening detection, head pose estimation, and face spoofing.

An additional quantized model is also added for face detector as described in Issue 14. This can be used by setting the parameter quantized as True when calling the get_face_detector(). On quick testing of face detector on my laptop the normal version gave ~17.5 FPS while the quantized version gave ~19.5 FPS. This would be especially useful when deploying on edge devices due to it being uint8 quantized.

Facial Landmarks

Earlier, Dlib's facial landmarks model was used but it did not give good results when face was at an angle. Now, a model provided in this repository is used. A comparison between them and the reason for choosing the new Tensorflow based model is shown in this article.

It is implemented in face_landmarks.py and is used for tracking eyes, mouth opening detection, and head pose estimation.

Note

If you want to use dlib models then checkout the old-master branch.

Eye tracking

eye_tracker.py is to track eyes. A detailed explanation is provided in this article. However, it was written using dlib.

Mouth Opening Detection

mouth_opening_detector.py is used to check if the candidate opens his/her mouth during the exam after recording it initially. It's explanation can be found in the main article, however, it is using dlib which can be easily changed to the new models.

Person counting and mobile phone detection

person_and_phone.py is for counting persons and detecting mobile phones. YOLOv3 is used in Tensorflow 2 and it is explained in this article for more details.

Head pose estimation

head_pose_estimation.py is used for finding where the head is facing. An explanation is provided in this article

Face spoofing

face_spoofing.py is used for finding whether the face is real or a photograph or image. An explanation is provided in this article. The model and working is taken from this Github repo.

FPS obtained

Functionality	On Intel i5
Eye Tracking	7.1
Mouth Detection	7.2
Person and Phone Detection	1.3
Head Pose Estimation	8.5
Face Spoofing	6.9

If you testing on a different processor a GPU consider making a pull request to add the FPS obtained on that processor.

Audio

It is divided into two parts:

Audio from the microphone is recording and converted to text using Google's speech recognition API. A different thread is used to call the API such that the recording portion is not disturbed a lot, which processes the last one, appends its data to a text file and deletes it.
NLTK we remove the stopwods from that file. The question paper (in txt format) is taken whose stopwords are also removed and their contents are compared. Finally, the common words along with its number are presented to the proctor.

The code for this part is available in audio_part.py

To do

~~Replace the HOG based descriptor by OpenCV's DNN modules Caffe model and it will also solve the issues created by side faces and occlusion.~~
~~Replace the dlib based facial landmarks with the CNN based facial landmarks as used in head_pose_detector.~~
Make a better face spoofing model as the accuracy is not good currently.
Use a smaller and faster model inplace of YOLOv3 that can give good FPS on a CPU.
Add a vision based functionality: face recognition such that no one else replaces the candidate and gives the exam midway.
Add a vision based functionality: id-card verification.
~~Update README with videos of each functionality and the FPS obtained.~~
~~Add documentation (docstring) in functions in codes.~~

Problems

Speech to text conversion which might not work well for all dialects.

Contributing

If you have any other ideas or do any step of to do consider making a pull request . Please update the README as well in the pull request.

License

This project is licensed under the MIT License - see the LICENSE.md file for details. However, the facial landmarks detection model is trained on non-commercial use datasets so I am not sure if that is allowed to be used for commercial purposes or not.

Jkotheimer/SmartVisor