- Overview
- Project Structure
- Pipeline Overview
- Data Collection
- Data Extraction
- Data Preprocessing
- Training the Gen AI Model
- Flask Backend
- Frontend with Next.js
- Installation
- Usage
- Future Improvements
- Contributing
- License
GenAIus KT is a Q&A chatbot designed for knowledge management within a company. It assists employees, especially new interns and trainees, in understanding ongoing and previous projects. The chatbot responds to queries related to educational content and project details, making knowledge transfer seamless and efficient.
GenAIus/
├── backend/
│ ├── Data/
│ │ └── (Initial raw data of multiple formats)
│ ├── DataChunks/
│ │ └── (Extracted data chunks from all_extracted_data.txt)
│ ├── Downloads/
│ │ └── (Connected with MongoDB to download data)
│ ├── AllCleanData.txt
│ ├── ExtractedRawData.txt
│ ├── app.py
│ ├── cleaningChunks.py
│ ├── downloadRawFiles.py
│ ├── embeddings.json
│ ├── environment.yml
│ ├── extractor.py
│ ├── model.py
│ ├── ScrapeHTML.py
│ ├── splittingDataToChunks.py
│ └── uploadRawFiles.py
├── frontend/
│ └── (Next.js files)
├── README.md
└── LICENSE
The pipeline for the GenAIus chatbot consists of several steps:
- Data Collection: Gathering company data from various file formats.
- Data Extraction: Extracting textual data using Python libraries.
- Data Preprocessing: Cleaning and structuring the extracted data using the Gemini AI model.
- Training the Gen AI Model: Creating vector embeddings and training the chatbot.
- Flask Backend: Setting up the backend for handling requests.
- Frontend Development: Building a user-friendly interface using Next.js.
The first step in the pipeline involves collecting data from various company documents, including:
- DOC/DOCX
- Google Docs (.gdoc)
- XLS/XLSX
- Google Sheets
- PPT/PPTX
- Google Slides
- JPG/PNG
- SVG
- CSV
- Markdown (MD)
- TXT/JSON/XML
- HTML
Since company data is often confidential, dummy but realistic data has been created in these formats.
Textual data extraction is performed using several Python libraries, which read the contents of various file formats and save them to a consolidated text file (ExtractedRawData.txt
). The libraries used include:
os
docx
csv
openpyxl
PyPDF2
cv2
pytesseract
pptx
selenium
(for web-based data)
The extracted textual data is preprocessed using the Google Gemini AI model. Given the large dataset, the data is chunked into smaller pieces and processed in batches. The cleaned data is saved into a file called AllCleanData.txt
.
The project utilizes the Gemini API key for the data cleaning and training parts. After cloning or forking the project, make sure to replace the placeholder in the .env
file with your own Gemini API key.
Once the data is cleaned, the next step is creating vector embeddings using the Gemini AI model. The chatbot uses these embeddings to retrieve relevant information based on user queries, ensuring it remains focused on its domain.
The Flask backend is responsible for connecting the frontend to the chatbot's processing logic. The backend handles requests and responses between the user interface and the AI model.
The user interface is built using Next.js, providing a user-friendly chat interface for employees to interact with the GenAIus chatbot. The frontend design emphasizes accessibility and ease of use.
To set up the project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/Pree-04/Team-GenAIus cd GenAIus
-
Important: After cloning or forking the project, make sure to change the directories and paths in the code to reflect your respective local paths where you have saved the project files.
-
Install backend dependencies: cd backend pip install -r requirements.txt
-
Set up the frontend: cd frontend npm install
-
Create a .env file in the backend directory and add your Gemini API key: GEMINI_API_KEY=your_gemini_api_key_here
To run the backend server: cd backend python app.py
To start the frontend: cd frontend npm run dev
Visit http://localhost:3000 to interact with the chatbot.
End-to-End Integration: Fully deploy the web application with comprehensive integration of the chatbot to enhance its accessibility. Hierarchical Access Control: Implement a feature that restricts access to confidential data based on the employee's position within the organization. This ensures that sensitive information is only accessible to those with the appropriate clearance.
Contributions are welcome! Please create a pull request or open an issue for discussion.
This project is licensed under the MIT License. See the LICENSE file for details.