Retrieval_Augmented_Generation_for_InterviewAI: A Python repository from Ji-eun-Kim

Development of RAG & LLM for AI Interview Service User Answer Evaluation

[인턴] AI 화상 면접 'IM' 고도화 | 사용자 역량 답변 평가를 위한 Custom RAG/LLM 개발

(2024.04. ~ 2024.06.)

1. Introduction

With the advent of Generative AI chatbots built on the LLM (Large Language Model), users have been able to get more personalized conversations and corresponding information, but it comes with some limitations.

Knowledge Cutoff: I can't answer data generated after model training because I can't learn (not include newly created information)
No access to private data: Inaccessible to confidential, personal, and non-public information inside the company.
Hallucinations: Hallucinations develop hallucinogenic symptoms that make unsubstantiated answers plausible and natural.
General Purpose: You can answer more than a certain level in many ways, but there are limitations such as lack of expertise in a specific area.

This project is a project to advance AI interview services, with a particular focus on advancing user response evaluation. Rather than evaluating based solely on LLM, it ultimately aims to 'improve the performance of the user's answer capability evaluation result' by introducing the RAG method and providing more specific domains and more information to LLM.

This allows us to provide services to help users prepare for interviews more effectively.

2. Organization

Organization: (주)위드마인드

3. Project Period

2024.03 ~ 2024.06

4. The service process steps

This project aims to advance the existing AI interview service 'IM' and attempted to introduce the RAG Method to the answer evaluation system. By providing more specific domains and more information to LLM through the application of the RAG method, it ultimately aims to 'improve the performance of the user's answer ability evaluation results'.

The flow of the study is as follows. First, with the Vector DB, excellent interview answer data for each job group established through expert evaluation is embedded and stored in the DataBase. After designing the DB first, the interviewer's answer text is input as input. After that, the top n contexts with high similarity to the answers are extracted, and the interviewer's answers are evaluated as high, medium, and low by using the extended prompt as the input of LLM. At this time, the extended prompt will contain Query + Prompt + n Contexts.

Sample Data used in the project used 100 answer texts, and Data Augmentation was conducted by extracting only 3points among the excellent example answers in the 'IM' DB provided by Incheon National University. Since the system to be produced at this time is the evaluation of the answers in the 'sales service group', only the answers corresponding to the communication capability were extracted. At this time, there are a total of 4 categories corresponding to communication capabilities, and excellent answers corresponding to clear content composition, communication technology utilization, effective exchange of opinions, and accurate technology utilization were extracted, and questions and 3-point answers were generated through Generative AI (GPT, Gemini). 100 sample data were created to have a ratio of 27% clear content composition, 25% communication technology utilization, 24% effective exchange of opinions, and 24% accurate technology utilization. The final 100 sample data were used for DB design by re-evaluating whether the GPT's answers were upper or lower.

In the case of Embedding Model, we used a high-performance, high-level model with additional *performance as a search and candidate list of Korean embedding models provided by the open source HuggingFace, and a total of 13 kor-Embedding models were tested on this domain.

sentence-transformers/paraphrase-multilingual-mpnet-base-v2
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
sentence-transformers/distiluse-base-multilingual-cased-v2
sentence-transformers/stsb-xlm-r-multilingual
jhgan/ko-sroberta-multitask
snunlp/KR-SBERT-V40K-klueNLI-augSTS
bongsoo/moco-sentencedistilbertV2.1
bongsoo/kpf-sbert-128d-v1
M-CLIP/M-BERT-Distil-40
google/canine-c
smartmind/roberta-ko-small-tsdae
BM-K/KoSimCSE-roberta-multitask
jinaai/jina-embeddings-v2-base-en

The performance of this Embedding model was confirmed by the similarity performance evaluation, and most of the similarity was selected as an embedding model with a consistently high similarity for the same sentence. Among the four categories of competency sub-classification, each sampling (selecting 3 random sentences) was conducted to evaluate the similarity.

In the case of Model 1, there was a problem that even if an answer from a domain completely different from the DB was added, the similarity came out high. Therefore, in this study, the Google/canine-c embedding model** of the 2nd model HuggingFace was selected as the final model. In the case of the Retrieval process, only the top 10 out of the answers with similarity of 0.92 or higher were set to be brought to the Prompt, and the extended Prompt was set as follows.

''You are an interviewer who evaluates the communication skills of the interviewees who enter the sales team. And the question is the interviewer's answer. The context is that the top n data from the DB correspond to 3 points.

Please refer to the DB and evaluate the interviewer's answers as high, medium, and low. And there are four categories of communication skills. 1. Communication skills 2. Accurate use of skills 3. Communication skills 4. Clear content composition From now on, I will present the evaluation criteria for each category. One point corresponds to the bottom, two points correspond to the middle, and three points correspond to the top.

an effective exchange of views 1 point : Communication procedures can be used properly to continuously express and communicate one's opinions. However, the ability to explain one's opinions based on the core of the problem or to improve the process to facilitate communication within the organization is relatively insufficient. Two points: Communication procedures can be used properly to continuously express and communicate your opinions and explain your opinions based on the core of the problem. However, the ability to improve the process to facilitate communication within the organization is relatively insufficient. 3 points : Communication procedures can be used properly to continuously express and communicate opinions based on the core of the problem. It is also believed that it has the ability to improve the process to facilitate communication within the organization.
Utilizing communication skills 1 point : Use appropriate data to communicate in a way that is easier for the other person to understand. However, the ability to create a smooth communication culture by actively utilizing non-verbal means of communication such as facial expressions and gestures or improving the organization's communication method is relatively insufficient. 2 points : actively utilize appropriate data and non-verbal methods such as facial expressions and gestures to convey one's intentions in an effective and easy-to-understand way. However, the ability to improve the organization's communication method and create a smooth communication culture is relatively insufficient. 3 points : Use appropriate data and non-verbal methods such as facial expressions and gestures to communicate one's intentions effectively and easily understood. It is also believed that it has the ability to create a smooth communication culture by improving the organization's communication method.
accurate communication 1 point : Can communicate your opinion concisely and clearly. However, it is judged that the ability to use appropriate communication methods, check whether others understand their opinions, and create an accurate communication atmosphere within the organization is relatively insufficient. 2 points: Communicate your opinion concisely and clearly, and use appropriate communication methods depending on the situation. However, it is judged that the ability to check whether others understand your opinion and create an accurate communication atmosphere within the organization is relatively insufficient. 3 points: Use appropriate communication methods according to circumstances and communicate your opinions concisely and clearly. It is believed that it can communicate the core content clearly, ensure that the members understand it correctly, and create an accurate communication atmosphere within the organization.
a clear composition of content 1 point: Use various data to organize your opinions and present them. However, after structuring and delivering what you want to convey, it is relatively insufficient to check whether it is being accurately delivered in various contexts. Two points: Organize and communicate your opinions based on sufficient data and logic. However, I believe that I am relatively lacking in the ability to verify that I am communicating exactly what I am trying to convey in various contexts. 3 points: Based on sufficient data and logic, you can structure and communicate your opinions. Furthermore, it is judged that you have the ability to check that what you are trying to convey is being accurately communicated in various contexts.
From now on, rate it as high, medium, and low, and 2. Tell me in detail why you rated it like that.
And, tell me the number of answers from the retriever

If you don't know the answer just say you don't know, don't make it up

Evaluation:
Reason :
Number of top n answers:

If the number of top n answers is 0, print out 0 in item 3. And the answer is 'ha'. Evaluate it as 'ha' in the evaluation section.

If you don't have anything to say, that is, if you're going to send it to a blank list, print it as "Ha" in the evaluation section Even if the answer is "I don't know", it's printed as "Ha" in the evaluation section

Please print it out according to the form''

In order to compare the performance when RAG was applied, the performance evaluation when only LLM was used and when RAG was introduced was conducted at the same time, and it was confirmed that the results when RAG was applied were evaluated for more valid reasons. The results were slightly different each time it was executed, and there were cases where evaluations were given for an invalid reason. LLM used the GPT 3.5 turbo model, and the project was concluded with the judgment that better results would come out if an API key with better LLM performance was used.

5. Demo Video

6. Realization

This project used the Naive RAG methodology and designed the initial beta model, so it did not have perfect performance, but I was able to experience various things by linking project planning, WBS, model experimentation and development, and FastAPI for a long period of four months. If there is time to develop in the future, I would like to modify and build a system with better performance.

7. Team members and responsible roles

a personal project

Create a project plan
Create WBS
Model Experiment and Development
FastAPI Server Interworking

Ji-eun-Kim/Retrieval_Augmented_Generation_for_InterviewAI