I use my standard setup, with a dedicated conda environment for this project, specified with the direnv layout feature. Required libraries can be installed with:
pip3 install -r requirements.txt
Note that my .envrc is encrypted with git-crypt.
Ollama allows us to run open-source LLMs on local machines. This is useful for enhanced privacy since one’s private data is never shared with LLM providers.
RAG is the process of augmenting an LLM prompt with relevant context drawn from one or more documents. Typically, documents are broken down into chunks (which may optionally overlap) of a certain size. These chunks are then split into tokens and converted into vectors of real numbers in a process called embedding. These vectors may be stored in a vector database or index. Later, a query may also be converted to a vector embedding which can be used to perform a similarity search against the index. The top matches may be retrieved and added to the context of an LLM prompt, along with the prompt for the model.
This Python class organizes and wraps LangChain classes to provide a simplified RAG interface over org-mode documents.
from langchain_community.document_loaders import DirectoryLoader, UnstructuredOrgModeLoader
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever
import os
class OrgModeDocumentStore:
def __init__(self, collection, directory, model="mixtral:latest",
search_type="mmr", mmr_diversity=0.75,
num_search_results=5, show_progress=False,
silent_errors=False):
self.collection = collection
self.directory = directory
if not os.path.exists(directory):
raise RuntimeError(f"Directory {directory} does not exist.")
self.index_directory = os.path.join(directory, ".chroma")
if not os.path.exists(self.index_directory):
os.mkdir(self.index_directory)
self.loader = DirectoryLoader(directory, glob="**/*.org", use_multithreading=True,
silent_errors=silent_errors,
loader_cls=UnstructuredOrgModeLoader,
loader_kwargs={"mode": "single"})
self.search_type = search_type
self.k = num_search_results
self.diversity = mmr_diversity
self.model = model
self.embeddings = OllamaEmbeddings(model=model, show_progress=show_progress)
self.db = Chroma(collection_name=collection,
embedding_function=self.embeddings,
persist_directory=self.index_directory)
def __repr__(self):
return f"""
OrgModeDocumentStore(
collection={self.collection!r},
directory={self.directory!r},
index_directory={self.index_directory!r},
loader={self.loader!r},
embeddings={self.embeddings!r},
model={self.model!r},
db={self.db!r},
search_type={self.search_type!r},
k={self.k!r},
diversity={self.diversity!r}
)"""
# indexing management
def load(self):
"Loads all org-mode documents found under the given directory recursively."
self.documents = self.loader.load()
def add_documents(self, docs):
"Adds the given docs to the Chroma vectorstore and returns the document ids."
return self.db.add_documents(docs)
def update_document(self, id, doc):
"Updates the single document identified by the id."
return self.db.update_document(id, doc)
def update_documents(self, ids, docs):
"Updates the documents identified by the given ids."
return self.db.update_documents(ids, docs)
def create_index(self):
"Creates the index from the loaded documents. This should only be run once."
self.load()
if len(self.documents) > 0:
print(f"Indexing {len(self.documents)} documents.")
return self.add_documents(self.documents)
# query
def print_documents(self):
"Print the list of all documents."
self.load()
for d in self.documents:
print(d.metadata['source'])
def similarity_search(self, query):
"Search the vectorstore for docs relevant to the query."
return self.db.similarity_search(query, self.k)
def mmr_search(self, query):
"Executes max marginal relevance search for the query."
return self.db.max_marginal_relevance_search(query, k=self.k, lambda_mult=self.diversity)
def as_retriever(self):
"Returns a retriever for this vectorstore."
return self.db.as_retriever()
The Document Loader abstraction presents a unified interface for loading various file types, including plain text, Markdown, JSON, and more. The constructor identifies the documents to load, and the load() method does the actual work.
Text Splitters break long documents into smaller chunks so we can pass them into an LLM context window.
- recursive
- splits on user-defined chars, keeps related chunks next to each other.
- token
- splits text on tokens
- character
- splits on user-defined chars
- semantic chunker
- splits on sentences, then combines adjacent ones if they are semantically similar enough
from orgstore import OrgModeDocumentStore
collection = "org-rag"
directory = "/Users/christian/Documents/personal/notes/content/roam/"
store = OrgModeDocumentStore(collection=collection, directory=directory, show_progress=True)
document_ids = store.create_index()
print(f"create_index: {document_ids}")
# data = zip(document_ids, store.documents)
# for id, doc in data:
# print(f"{id}: {doc.metadata['source']}")
Use the vector store to find relevant documents.
from orgstore import OrgModeDocumentStore
collection = "org-rag"
directory = "/Users/christian/Documents/personal/notes/content/roam/"
store = OrgModeDocumentStore(collection=collection, directory=directory, silent_errors=True)
i, query = 1, ""
print("Enter search query at the prompt or type '?list' for docs, or '?quit' to exit.\n")
while not query.lower() == "?quit":
query = input(f"{i}> ")
if query == "?quit":
print("Goodbye.")
elif query == "?list":
i += 1
store.print_documents()
else:
i += 1
#results = store.as_retriever().get_relevant_documents(query)
#results = store.mmr_search(query)
results = store.similarity_search(query)
for doc in results:
print(f"file: {doc.metadata['source']}, length: {len(doc.page_content)}")
display = input("Display page content? (y|n)> ")
if display.lower() == "y":
print(f"content: {doc.page_content}\n" )
print("-" * 80)
I’m not thrilled with these results. The chunks are very small and anecdotally not the most relevant. I’d like to feed more context to an LLM.