Purpose of this project to play around with modern RAG libraries and create retrieval system for my blog.
This code is not recommended for usage to anyone. If you want - you can pick some parts from it, or check in learning purposes.
Why commands have so strange names?
Well they are done in order, to emulate pipeline.
For convenience, we store intermediate results in files using joblib.
BlogPostsReader
reads data from database and createsPostDocument
'sDiskDumpReader
reads dumped data from disk withPostDocument
's
PostDoumentsLoader
loads data fromPostDocument
's into LangchainDocument
s, this also includes splitting our documents
MarkdownSplitter
splits text into LangchainDocument
s, I've had to write own because strangely markdown splitters from Langchain, Unstructured and LlamaIndex all failed to make correct splits and identify code blocks, which is very strange.SentenceSplitter
wrapper onSentenceTransformersTokenTextSplitter
to make it compatible with interface and easy usage
I'm using sentence-transformers
library with all-MiniLM-L6-v2
model because it's small and fast.
That's why I used in vector store one provided from
import os
from langchain_chroma import Chroma
from langchain_community.embeddings import SentenceTransformerEmbeddings
from app.settings import SettingsLocal
from components.interfaces import Component
class VectorStore(Component):
def __init__(self):
super().__init__()
self.config = {
"posts_directory": os.path.join(SettingsLocal.DATA_DIR, "posts"),
"embedder": SentenceTransformerEmbeddings(
model_name=SettingsLocal.TRANSFORMERS_MODEL,
)
}
And you can implement langchain_core.embeddings.embeddings.Embeddings
interface and add your own embedder to components.
You can easily add it to vector store via config dict.
VectorStoreRetriever
our dense retriever. Retrieves relevant documents from vectorstore.
If you want full scale RAG system this is required. You will need some model for which you can feed your retrieved the closest documents as a context and your query as question, and generate response based on this.