aishe.ai Core
The goal of is to provide a solution for small and medium enterprises in Europe to use LLM-based AI in a manner compliant with GDPR and typical privacy concerns of companies.
aishe.ai allows teams to ask questions in natural language about the organization, projects, processes, and applications. To achieve this, aishe.ai scrapes the company's information systems, such as Confluence, documents, and Git repositories. It serves as a virtual team assistant, distinguishing itself from a personal assistant.
- “Build your secure ChatGPT with your own data”
- Autonomous Tool Usage:
- Google search
- Visit websites and scrape entire sitemaps
- Firecrawl
- Image Generation:
- DALL-E
- Multi-modal capabilities, allowing it to answer questions about images
- File Translation:
- Translate files like PDFs using DeepL
- GitHub Integration:
- Access and regularly synchronize with internal knowledge sources
- GitHub repository: aishe-ai/airbyte (Airbyte Fork)
- Chat Integration:
- Slack
- User Feedback:
- Users can provide feedback and rate/comment on LLM outputs
- Integrated with Langfuse
- Environment Support:
- Supports different environments as long as Docker Compose or Kubernetes is available
- Support for More Data Sources: Expanding to include data sources not yet supported by Airbyte.
- Develop Custom Airbyte Sources: Creating proprietary Airbyte sources to meet unique requirements.
- Self-Service Portal:
- Configuration: Allow users to configure their own settings.
- Monitoring: Provide tools for users to monitor their data and AI interactions.
- Setup: Simplify the setup process for new users.
- Project-Based Assistant: Tailoring the assistant to work on a per-project basis rather than for the entire organization.
- Overview Queries: For example, answering questions like "What security vulnerabilities does this project have?"
- Data Integration: Combine data from various sources to derive actions or questions.
- Git Tool Integration: Boris is interested in integrating his Git tool.
- Email Drafting: Automate the creation of email drafts.
- Meeting Transcript Integration: Connect meeting transcript sources like Fireflies to Airbyte.
- Local LLMs: Use local LLMs instead of cloud providers to enhance privacy and control.
- Further Chat Integrations: Expand chat integrations to include platforms like Teams and Zoom.
- Additional Data Sources:
- SharePoint: Integrate with SharePoint.
- RBAC-Compliant Vector Table: Implement a vector table that respects the role-based access control (RBAC) of the source.
flowchart TD
subgraph ChatInteraction [Chat, e.g. Slack]
User-->|Asks questions| LLM-Agent
LLM-Agent -->|Reponds| Response[Response]
Response -->|User Reacts Negatively| UserReaction[User Reaction]
UserReaction -->|Stores Reaction for Optimization| Langfuse[Langfuse]
end
subgraph LLMOrchestration [LLM Orchestration]
Langchain -->|Orchestrates| LLM-Agent[LLM Agent]
Langchain -->|Provides| LLMTools
LLM-Agent -->|Can use | LLMTools
LLM-Provider -->|Provides | LLM-Agent
subgraph LLMTools [LLM Tools]
PGVector
WebSearch[Web Search]
ImageGen[Image Generation]
DocTranslation[Document Translation]
end
end
subgraph DataIntegration [Data Integration]
DataSources[Data Sources, e.g. Sharepoint] -->|Ingested by| Airbyte[Airbyte]
Airbyte -->|Calls with Source Data | LLM-Provider[LLM Provider, e.g. openai]
LLM-Provider -->|Embeddings| Airbyte
Airbyte -->|Stores Embeddings and Source Data| PGVector[PGVector Database]
end
Platform -->|Auto-Setup/Usage| ChatInteraction
Platform -->|Configures/Status| DataIntegration
Platform -->|Configures| LLMOrchestration
Admin-User -->|Uses| Platform
- You have to create a new slack app, URL
- Modify the below manifest to your needs/config, following are required to change:
request_url
,url
- your instance url
{
"display_information": {
"name": "aishe.ai",
"description": "Assistant LLM",
"background_color": "#070708",
"long_description": "Chat Assistant"
},
"features": {
"bot_user": {
"display_name": "aisheAI",
"always_online": true
},
"slash_commands": [
{
"command": "/aishe-health-check",
"url": "https://$DOMAIN/healthcheck",
"description": "Check if the backend services are running",
"should_escape": false
},
{
"command": "/aishe-example-prompts",
"url": "https://$DOMAIN/example-prompts/",
"description": "Shows examples prompts",
"should_escape": false
}
]
},
"oauth_config": {
"scopes": {
"bot": [
"channels:history",
"channels:read",
"chat:write",
"commands",
"files:read",
"files:write",
"groups:history",
"groups:read",
"mpim:read",
"remote_files:write",
"users:read"
]
}
},
"settings": {
"event_subscriptions": {
"request_url": "https://$DOMAIN/slack/event/",
"bot_events": [
"member_joined_channel",
"message.groups"
]
},
"interactivity": {
"is_enabled": true,
"request_url": "https://$DOMAIN/slack/rating/"
},
"org_deploy_enabled": false,
"socket_mode_enabled": false,
"token_rotation_enabled": false
}
}
- Install Python 3.10 or set the version to it.
- Install necessary system packages:
sudo apt install postgresql postgresql-contrib sudo apt-get install python3.10-dev sudo apt-get install --reinstall libpq-dev
- Make sure that the
LD_LIBRARY_PATH
is set, usesudo find / -name "libpq.so.5" 2>/dev/null
to find it
- Copy
.env.example
to.env
and modify the content. - Install
tesseract-ocr
for your system. - Install Python dependencies, see Poetry
- Install Chromium and other dependencies:
pip install -q playwright beautifulsoup4 playwright install
- Create an ngrok domain.
- Install ngrok.
- Set up ngrok agent auth.
- Set up Google access for LLM and add keys to
.env
. - Set up Langsmith in
.env
. - Set up Langfuse and its needed envs.
- Set up Firecrawl and its needed envs.
- Start FastAPI:
or
uvicorn app:app --reload
python3.10 -m uvicorn app:app --reload --port 8888
- Start ngrok:
(Domain must be the same as the bot creation)
ngrok http --domain=DOMAIN 8000
Poetry is a tool for dependency management and packaging in Python. It helps to manage project dependencies, build and publish packages, and ensure reproducibility.
- Install Poetry:
curl -sSL https://install.python-poetry.org | python3 -
- Install project dependencies:
poetry install
- Activate the virtual environment created by Poetry:
poetry shell
- Follow steps 0-8 from the Conventional Setup to set up the environment variables and required services.
- Build the core image:
docker build -t aishe-ai-core .
- Optionally, choose if you want/able to run Airbyte:
- Run the Docker Compose stack:
docker compose -f dev-docker-compose.yaml -p unified_aishe_ai up
- Follow steps 0-8 from the Conventional Setup to set up the environment variables and required services.
- Start the production environment with Docker Compose:
docker-compose -f prod-docker-compose.yaml --env-file .env -p aishe_ai up
- Public image repository
- Run the Docker image:
docker run -d -p 80:80 --env-file .env aishe-ai
- If the browser is not starting (e.g., within the
webpage_tool
):- Add to the browser launch parameters:
args=["--disable-gpu"]
- Example:
browser = await p.chromium.launch(headless=True, args=["--disable-gpu"])
- This issue is commonly observed with WSL2 systems.
- Add to the browser launch parameters:
- Use
black
for code formatting:black .
- Ensure each folder has a
__init__.py
file. If unsure, run:find . -type d \( -path './.venv' -o -path './__pycache__' -o -path './downloads' -o -path './sql' \) -prune -o -type d -exec sh -c 'for dir; do [ ! -f "$dir/__init__.py" ] && touch "$dir/__init__.py" && echo "Created: $dir/__init__.py"; done' sh {} +
- Run the module:
python -m llm.vectorstores.pgvector.non_rbac
Data Integration
- Example: Airbyte is used to seamlessly integrate various data sources into aishe.ai. For instance, it can connect to a company’s Confluence space, Git repositories, and document storage to aggregate all relevant data.
- Usage in aishe.ai: Airbyte facilitates the extraction, transformation, and loading (ETL) of data from disparate sources, ensuring that aishe.ai has access to up-to-date and comprehensive information for generating accurate responses and insights.
User Feedback and Interaction Management
- Example: Langfuse is used to collect and manage user feedback on AI-generated outputs. For example, after aishe.ai provides an answer or generates a document, users can rate the response and provide comments.
- Usage in aishe.ai: Langfuse helps in gathering user feedback, which is crucial for continuous improvement of the AI models. It allows aishe.ai to adapt and refine its responses based on real user interactions and feedback.
Building Complex AI Workflows
- Example: Langchain is used to create complex workflows where multiple AI models and tools are orchestrated to solve specific tasks. For instance, generating a project report might involve data retrieval, natural language processing, and summarization steps.
- Usage in aishe.ai: Langchain enables the creation of sophisticated pipelines that combine various AI capabilities, ensuring that aishe.ai can handle multi-step processes efficiently and effectively.
Backend API Development
- Example: FastAPI is used to develop the backend APIs that power aishe.ai’s functionalities. For example, endpoints for querying project data, submitting feedback, or configuring settings are all built using FastAPI.
- Usage in aishe.ai: FastAPI provides a robust and high-performance framework for building the backend services that support aishe.ai’s operations, ensuring fast and reliable API responses.
Vector Database for Semantic Search
- Example: PGVector is used to store and search through vector representations of textual data. For instance, meeting notes and project descriptions are converted into vectors for efficient semantic search.
- Usage in aishe.ai: PGVector allows aishe.ai to perform advanced searches and retrieve relevant information based on semantic similarity, enhancing the accuracy and relevance of the AI’s responses.
Website Crawling and Data Extraction
- Example: Firecrawl is used to crawl and convert any website into LLM-ready markdown or structured data. For instance, aishe.ai can use Firecrawl to gather data from a company’s public web pages or internal sites without requiring a sitemap.
- Usage in aishe.ai: Firecrawl enhances aishe.ai’s ability to gather comprehensive data from web sources. It provides powerful scraping, crawling, and data extraction capabilities, enabling aishe.ai to convert website content into clean, structured data that can be used for various AI applications, such as answering queries or generating reports. This integration ensures that aishe.ai can access and utilize a wide range of web-based information efficiently.
Privacy-Preserving Language Models
- Example: Local LLMs are deployed to ensure data privacy and control. For instance, sensitive company data is processed by locally hosted language models rather than sending it to cloud-based services.
- Usage in aishe.ai: By using local LLMs, aishe.ai ensures compliance with GDPR and other privacy regulations, providing a secure environment for processing sensitive information without compromising on the AI’s capabilities.
By leveraging these technologies, aishe.ai provides a robust, secure, and efficient AI solution tailored to the needs of small and medium enterprises in Europe.
The following steps outline the process for handling prompts regarding internal company data, which is regularly scraped and updated in the database. This process is designed to retrieve relevant document vectors based on the user's access rights, determined by their memberships in various data sources.
- Objective: Identify the member based on the provided email.
- Process: The system searches the
members
table using the given email address. This table contains member details, including their unique identifier (uuid
), which is crucial for subsequent steps.
- Objective: Determine the data sources to which the member has access.
- Process: With the member's
uuid
, the system retrieves all associated memberships from thememberships
table. Each membership record links a member to a data source and potentially to specific documents within that source.
- Objective: Find documents relevant to the user's prompt, to which the user has access.
- Process:
- The system uses the memberships obtained in the previous step to identify accessible documents. This involves a join operation between the
memberships
table and the dynamically nameddocument_table__{organization_name}_{data_source_name}
, using thedocument_uuid
. - A similarity search is conducted on the
embeddings
field within thedocument_table
. This search finds documents whose vector representations are similar to the vector representation of the user's prompt. - This step is crucial for ensuring that the user only accesses documents they are permitted to view, based on their memberships.
- The system uses the memberships obtained in the previous step to identify accessible documents. This involves a join operation between the
- Objective: Enhance the language model's context with the found document vectors.
- Process:
- The vectors retrieved from the similarity search are added to the prompt's vector space. This integration is part of the language chain processing, which occurs outside the database.
- This step is essential for tailoring the language model's responses to be more relevant and informed by the specific content the user has access to.
- The efficiency of this process is heavily reliant on the proper indexing of tables, especially for large datasets. Indexes on fields like
email
in themembers
table anduuid
fields in all tables are crucial. - The similarity search's performance in the
document_table
depends on the implementation of vector operations in PostgreSQL, particularly the use ofpgvector
. - This flow assumes a robust system for managing and querying dynamically named
document_table
s, which is vital for the scalability and maintainability of the system.
erDiagram
organizations ||--|{ data_sources : "belongs_to; one per airbyte source"
organizations ||--|{ members : "belongs_to"
data_sources ||--o{ document_table : "has ; one table per source: allow different vector indizes"
members ||--o{ memberships : "belongs_to"
data_sources ||--o{ memberships : "belongs_to"
document_table ||--o{ memberships : "belongs_to"
organizations {
uuid uuid PK
name string
description string
}
data_sources {
uuid uuid PK
organization_uuid uuid FK
name text
description text
bot_auth_data jsonb
document_table_metadata jsonb
airbyte_meta_data jsonb
}
members {
uuid uuid PK
organization_uuid uuid FK
email text
name text
}
"document_table__{organization_name}_{data_source_name}" {
uuid uuid PK
data_source_uuid uuid FK
name text
description text
url text
context_data jsonb
embeddings pgvector
content text
}
memberships {
uuid uuid PK
data_source_uuid uuid FK
member_uuid uuid FK
document_uuid uuid
document_table_name text
data_source_meta_data jsonb
}
- Primary Key Index on
uuid
. - Optional Index on
name
if frequently queried.
- Primary Key Index on
uuid
. - Foreign Key Index on
organization_uuid
. - Optional Index on
name
if frequently queried.
- Primary Key Index on
uuid
. - Foreign Key Index on
organization_uuid
. - Index on
email
for search operations.
- Primary Key Index on
uuid
. - Foreign Key Index on
data_source_uuid
. - Optional Indexes on
name
,url
,metadata
. - Appropriate indexing for
embeddings
(pgvector).
- Primary Key Index on
uuid
. - Foreign Key Indexes on
data_source_uuid
,member_uuid
,document_uuid
.
erDiagram
langchain_pg_collection ||--o{ langchain_pg_embedding : belongs_to
langchain_pg_collection {
uuid uuid PK
name varchar()
cmetadata json
}
langchain_pg_embedding {
uuid uuid PK
embedding vector
document varchar()
cmetadata json
custom_id varchar()
collection_id uuid FK
}