This is the web user interface for a local chat completion app for use with LLMs running locally, as well as a persistent message and document embeddings store using Postgres. With this, the user can perform Retrieval Augmented Generation on their local documents easily through the Web UI.
This ensures that the data stays on the user's device, as well as the history of the chats and documents remain when the user shuts down or restarts the application/their computer.
LLMs are run locally using Ollama.
This app provides a number of features.
As can be seen from above, you can freely converse with files that are already embedded and uploaded to the app's database. They can also remove files from the knowledge base when they are no longer needed.
The app also provides formatting for easier visualisation of code outputs and language-specific syntax highlighting.
As can be seen from below the streaming input, the user can also stop or interrupt the response stream. Do note that for now, this may cause the underlying Ollama instance to bug out and cause latency problems.
This app also provides breakpoints to exclude earlier parts of the chat history from the prompt given to the model. This can help focus the output on the latest prompts, i.e. when talking to the model about a different part of the same document.
To add additional files to the conversation, the user can either drop in new ones, or link additional ones to the conversation from those that were previously uploaded.
This can be used effectively in combination with breakpoints to talk to multiple documents in the same chat.
With output regeneration on every model reply, the user can add as many breakpoints as they wish and regenerate the output as needed.
-
Clone this repository.
-
Ensure that you have either Docker Desktop or Docker Engine(Linux Server)running on your computer.
- To start, simply launch the application for Docker Desktop, or follow the instructions for Docker Engine on their website.
-
Set up Ollama and start it up. It only works on Mac/Linux for now.
-
Ensure that you have Docker on your system. Start the Docker Desktop app. Also ensure that you have the version of Node and NPM listed here:
-
In the project root, run these commands (If on Windows, use Git Bash or equivalent UNIX-based shells):
# Prepare the workspace make setup # Start the services make up # If it is your first time npm run build # Start the UI npm run start
-
Go to http://localhost:3000 in your browser.
-
Take care not to upload files larger than 5MB. The app may error due to the large volume of embeddings needed.
-
To stop the app, run Cmd(Mac)/Ctrl(Windows)+C.
-
To take down the database, run this command:
make down
-
To remove the database data entirely, run:
make clean
We have a Dockerfile to deploy the UI as a standalone container. Do note that the Desktop Ollama application has CORS issues with the Docker container, so experimentation may be needed.
When choosing an Embedding model, we want one which performs well on the MTEB (Massive Text Embedding Benchmark).
While choosing a high performance one matters, we also need the latency to be low. Hence, we need to pick a decent performing model that is not too large so that we can run it locally with good results.
Models Tested:
- Xenova/gte-base - Better quality, slower
- Xenova/all-MiniLM-L6-v2 - Lower quality, faster
To test other models, ensure that the model has ONNX weights so that our client (Langchain.js) supports it.
Then, in the file huggingfaceEmbeddings.ts, change the model name (We suggest Xenova's Feature Extraction models on HuggingFace)
For our model, we run Mistral 7B via Ollama.
For a model to work with the app, we use the LangchainJS
ChatOllama
interface. To test other models, you
may change the parameters in the chatOllama.ts file to pass a different model
name or endpoint URL for your Ollama container.
For this, we use a persisted PG Vector database for both the app metadata to persist your chat history to your local Docker Postgres container.
To configure embeddings you may edit the
vectorStore.ts file to
change the database. Do note that it may or may not
affect the other files, such as API Routes within the app/api
folder.
To change the metadata store, you may adjust the dbInstance.ts file parameters for your database connection parameters.
With Large Language Models on the rise, models such as Anthropic's Claude and ChatGPT are all the rage. However, they can only run on servers run by external parties and data is sent over the wire to them.
While they may be powerful, often times they are overpowered for simple applications like code generation, small tasks and analysis of small documents. This is where smaller models that are open-source come in.
With instructional fine-tuning and sliding context windows, smaller models such as Mistral AI's Mistral 7B demonstrate that smaller models can perform decently when fine-tuned to certain prompt formats.
For instance, Mistral 7B is fine tuned on an instructional
format with Instruction Tokens ([INST]
, [/INST]
)
that eliminate the need for complex prompting. They simply
require the user to feed their instructions between the
tokens, and the model will take it as an instruction and
give an output accordingly.
This leaves more context for smaller models such as Mistral 7B to take in larger inputs, regardless of its smaller size and context window. Using our Ollama.ai configuration running locally, we are able to use up to 32678 characters of context length.
While such Large models may be trained on large corpuses of data, often times it may be hard to retrieve accurate or relevant data to the user's query given the sheer size of the model's training data.
By performing an indexing over relevant source documents, and retrieving the relevant context from these documents with regards to a user's query, we can provide this context to the model for it to ground its response based on the context retrieved from our source documents.
This makes the model's output more relevant to the query and the use case of the user.
Often times, such tools combining instructional fine-tuning be it through prompts, fine-tuned models such as GPT-4 or even RAG Chat with PDF apps require a paywall to use.
In addition to the above, this app aims to solve 3 problems, namely:
-
Paying to use these features.
Through the use of similar fine-tuned models from open-source contributors, and careful prompting, we can often achieve similar results on smaller contexts or documents.
-
Providing the ability to regenerate and reset the chat context
When a chat history grows long, the model's response may be focused on the messages at the start of the chat.
This app aims to mitigate this by allowing the user to add/delete breakpoints in the chat, regenerate outputs, and finetune the model's responses based on contiguous subsets of the chat history.
For instance, the start of the chat may ask about one part of a document, leading the model to focus on that document.
By adding a breakpoint, the user can start fresh and ask questions about other parts of the document, letting the model give answers as if the context was fresh.
-
Data Privacy
When you use a "Chat with your PDF" tool, you upload data to their model and/or server. This is not an option for sensitive documents.
This app mitigates that by:
- Providing a local Vector Store using Postgres to save your document embeddings to your local disk with Docker volumes
- Downloading open-source ONNX binaries to run embeddings locally within the app, instead of remote models.
- Using of local model weights with models such as [Mistral] running within Ollama, and your traffic not being sent to external servers.
By running this cluster and connecting it to your Ollama-based container, I hope this improves your experience in implementing custom chat-model solutions as opposed to paying large premiums for external models.