This repository contains a voice-based AI assistant that uses speech recognition for input and text-to-speech for responses. The application consists of a React frontend for user interaction and a backend that leverages various AI services for processing and responding to queries.
https://drive.google.com/file/d/1Br5arurH0bYXXIDY2zXVXFr-4myQ1d-5/view?usp=drivesdk
The frontend is built with React and TypeScript, featuring a voice-based interface that:
- Captures user speech via the microphone
- Displays the transcribed text
- Sends the text to the backend for processing
- Receives and speaks the AI response
The application uses the Web Speech API, specifically:
- SpeechRecognition API (
webkitSpeechRecognition):
// Creates a speech recognition instance
const recognition = new (window as any).webkitSpeechRecognition();
// Configuration
recognition.lang = "en-US";
recognition.continuous = true;
recognition.interimResults = true;- Speech Synthesis API (
window.speechSynthesis):
// Access the speech synthesis API
const synth = window.speechSynthesis;
// Create an utterance and configure it
const utterance = new SpeechSynthesisUtterance(text);
utterance.lang = "en-US";
utterance.rate = 1;
utterance.pitch = 1;
// Speak the text
synth.speak(utterance);webkitSpeechRecognition: Browser API for speech recognitionrecognition.start(): Begins listening for speechrecognition.stop(): Stops listening for speechrecognition.onstart: Event handler when recording beginsrecognition.onresult: Event handler when speech is recognizedrecognition.onerror: Event handler for recognition errorsrecognition.onend: Event handler when recording endswindow.speechSynthesis: Browser API for text-to-speechSpeechSynthesisUtterance: Creates a speech synthesis requestsynth.speak(): Speaks the provided textsynth.cancel(): Stops any ongoing speech
The backend uses several API keys and environment variables:
GEMINI_API_KEY=
COHERE_API_KEY=
PINECONE_API_KEY=
PINECONE_INDEX=
PINECONE_ENVIRONMENT=
PINECONE_HOST=
- Data Source: JSON format data
- Chunking: The JSON data is split into smaller chunks for efficient processing
- Vector Database: Chunks are stored in Pinecone (vector database)
- Semantic Search: When a query is received, the system performs semantic search in Pinecone
- LLM Response: Gemini AI model generates responses based on the top semantic search results
The backend is built with TypeScript and Node.js. To run it:
- Install dependencies:
npm install
- For development:
npm run dev
This uses tsx to run the TypeScript code directly.
- For production:
npm run build
npm run start
This compiles the TypeScript code to JavaScript and then runs it.
- User speaks into the microphone on the frontend
- Speech is converted to text using the Web Speech API
- Text is sent to the backend API
- Backend performs semantic search in Pinecone using the query
- Top search results are sent to Gemini AI to generate a response
- Response is sent back to the frontend
- Frontend converts the text response to speech using the Speech Synthesis API
- Clone the repository
- Navigate to the frontend directory
- Install dependencies:
npm install
- Start the development server:
npm start
- Navigate to the backend directory
- Create a
.envfile with the environment variables listed above - Install dependencies:
npm install
- Run the development server:
npm run dev
The Web Speech API is not supported in all browsers. For best results, use:
- Chrome
- Edge
- Safari (partial support)
Firefox and some mobile browsers may have limited or no support for the speech recognition features.