This project evaluates AI conversation bots using multiple predefined metrics such as answer relevance, faithfulness, summarization, and more. It includes:
- Node.js (TypeScript) Backend for managing evaluation requests.
- FastAPI (Python 3.12) for running evaluations and handling metrics.
- React Frontend for inputting conversation data and displaying results.
- Docker to containerize and orchestrate the services.
- Node.js: 20 or above
- Python: 3.12
- Docker and Docker Compose
.
├── node-be # Node.js backend for API requests and metrics evaluation
├── py-be # FastAPI app handling LLM metrics evaluation logic
├── react-fe # React frontend for user input and displaying results
├── docker-compose.yml # Docker setup for orchestrating services
git clone https://github.com/pulkit-khullar/llm-evaluation.git
cd llm-evaluation
Note - Convert env.example to .env before procedding further.
-
Node.js:
- Navigate to
node-be/
. - Install dependencies:
npm install
- Run the Node.js backend locally:
npm run dev
- Navigate to
-
FastAPI:
- Navigate to
py-be/
. - Install dependencies using Poetry:
poetry install
- Run FastAPI app locally:
poetry shell python main.py
- Navigate to
- Navigate to
react-fe/
. - Install dependencies:
npm install
- Run the React app locally:
npm start
- Build and start all services:
docker-compose up --build
- Access the services:
- Node.js Backend:
http://localhost:8003
- FastAPI Backend:
http://localhost:8004
- React Frontend:
http://localhost:3000
- Node.js Backend:
- React Frontend: Enter conversation history, user questions, bot answers, and context. Select desired metrics for evaluation and submit.
- Node.js Backend: Receives the frontend request and forwards the evaluation tasks to the FastAPI service.
- FastAPI Service: Processes evaluation metrics like
answer_relevance
,summarization
, etc., and returns the results to the Node.js backend.
- Answer Relevance
- Bias
- Contextual Relevancy
- Faithfulness
- Hallucination
- Summarization
GET /api/health
Application health check
POST /api/evaluate
REQUEST BODY:
conversationHistory: Array of String
userQuestion: String
botAnswer: String
context: String
metrics: Array of String | Allowed values are - answer_relevance, bias, contextual_relevancy, faithfulness, hallucination, summarization
RESPONSE BODY:
{
message: "Evaluation completed successfully",
result: {
...Metrics names and there scores along with explanation
}
}
Application Evaluate end-point, This add's the first level request validation at this point. This also forwards the request to Python service for result evaluation.
GET /api/health
Application health check
POST /api/evaluate
REQUEST BODY:
conversationHistory: Array of String
userQuestion: String
botAnswer: String
context: Array of String
metrics: Array of String
RESPONSE BODY
{
metric_name: {
score: ...result score,
reson: ...explanation of result score.
},
metric_name_2: {
score: ...result score,
reson: ...explanation of result score.
},
...
}
Application Evaluate end-point, This is responsible for processing the evaluation for the selected metrics. This formats the result and send the JSON object to the Node service, which then further send's the result back to React service.
- Ensure all services are running for full functionality.
- Adjust port settings in
docker-compose.yml
if conflicts arise.