/ai_voicr_bot

app demo video link

Primary LanguageTypeScript

This repository contains a voice-based AI assistant that uses speech recognition for input and text-to-speech for responses. The application consists of a React frontend for user interaction and a backend that leverages various AI services for processing and responding to queries.

App demo video link

https://drive.google.com/file/d/1Br5arurH0bYXXIDY2zXVXFr-4myQ1d-5/view?usp=drivesdk

Frontend Setup

Overview

The frontend is built with React and TypeScript, featuring a voice-based interface that:

  • Captures user speech via the microphone
  • Displays the transcribed text
  • Sends the text to the backend for processing
  • Receives and speaks the AI response

Voice Recognition Implementation

The application uses the Web Speech API, specifically:

  1. SpeechRecognition API (webkitSpeechRecognition):
// Creates a speech recognition instance
const recognition = new (window as any).webkitSpeechRecognition();

// Configuration
recognition.lang = "en-US";
recognition.continuous = true;
recognition.interimResults = true;
  1. Speech Synthesis API (window.speechSynthesis):
// Access the speech synthesis API
const synth = window.speechSynthesis;

// Create an utterance and configure it
const utterance = new SpeechSynthesisUtterance(text);
utterance.lang = "en-US";
utterance.rate = 1;
utterance.pitch = 1;

// Speak the text
synth.speak(utterance);

Key Window Functions Used

  • webkitSpeechRecognition: Browser API for speech recognition
  • recognition.start(): Begins listening for speech
  • recognition.stop(): Stops listening for speech
  • recognition.onstart: Event handler when recording begins
  • recognition.onresult: Event handler when speech is recognized
  • recognition.onerror: Event handler for recognition errors
  • recognition.onend: Event handler when recording ends
  • window.speechSynthesis: Browser API for text-to-speech
  • SpeechSynthesisUtterance: Creates a speech synthesis request
  • synth.speak(): Speaks the provided text
  • synth.cancel(): Stops any ongoing speech

Backend Setup

Environment Variables

The backend uses several API keys and environment variables:

GEMINI_API_KEY=
COHERE_API_KEY=

PINECONE_API_KEY=
PINECONE_INDEX=
PINECONE_ENVIRONMENT=
PINECONE_HOST=

Data Processing Pipeline

  1. Data Source: JSON format data
  2. Chunking: The JSON data is split into smaller chunks for efficient processing
  3. Vector Database: Chunks are stored in Pinecone (vector database)
  4. Semantic Search: When a query is received, the system performs semantic search in Pinecone
  5. LLM Response: Gemini AI model generates responses based on the top semantic search results

Running the Backend

The backend is built with TypeScript and Node.js. To run it:

  1. Install dependencies:
npm install
  1. For development:
npm run dev

This uses tsx to run the TypeScript code directly.

  1. For production:
npm run build
npm run start

This compiles the TypeScript code to JavaScript and then runs it.

How It All Works Together

  1. User speaks into the microphone on the frontend
  2. Speech is converted to text using the Web Speech API
  3. Text is sent to the backend API
  4. Backend performs semantic search in Pinecone using the query
  5. Top search results are sent to Gemini AI to generate a response
  6. Response is sent back to the frontend
  7. Frontend converts the text response to speech using the Speech Synthesis API

Setup Instructions

Frontend Setup

  1. Clone the repository
  2. Navigate to the frontend directory
  3. Install dependencies:
npm install
  1. Start the development server:
npm start

Backend Setup

  1. Navigate to the backend directory
  2. Create a .env file with the environment variables listed above
  3. Install dependencies:
npm install
  1. Run the development server:
npm run dev

Browser Compatibility

The Web Speech API is not supported in all browsers. For best results, use:

  • Chrome
  • Edge
  • Safari (partial support)

Firefox and some mobile browsers may have limited or no support for the speech recognition features.