This repository contains a pure Python implementation of a Retrieval-Augmented Generation (RAG) system with an in-memory vector database. The project showcases a basic yet effective approach to store, query, and retrieve information based on vector similarity. It is an educational tool that demonstrates the foundational aspects of information retrieval, natural language processing (NLP), and their application in building intelligent systems.
The core of this project is the InMemoryVectorDB
class, which facilitates the storage and retrieval of documents using an in-memory vector database. The implementation showcases how to interact with an external API to fetch data, process and store it in a structured format, and then query the database to find the most relevant information based on cosine similarity of vector representations.
- In-Memory Vector Database: A simple, custom-built database to store document vectors and metadata in memory.
- Cosine Similarity-Based Retrieval: Implements cosine similarity to find the closest matching document vector to a given query vector.
- Data Fetching and Processing: Example code to fetch currency conversion rates from an API, process, and store them in the database.
- Tokenization and Vectorization: Utilizes a bag-of-words approach for transforming documents and queries into vector representations.
To get started with this project, clone the repository to your local machine:
git clone https://github.com/yourusername/pure-python-rag-with-in-memory-vector-db.git
Navigate into the project directory:
cd pure-python-rag-with-in-memory-vector-db
This project is built with standard Python libraries along with requests
for API interactions. Ensure you have Python 3.x installed, then install the necessary dependencies:
pip install requests
The project is structured into two main files: inmemorydb.py
and main.py
.
inmemorydb.py
contains the implementation of theInMemoryVectorDB
andCollection
classes.main.py
demonstrates how to fetch data from an external API, process and store it in the in-memory database, and then query the database.
To run the example, execute:
python main.py
Contributions to improve the project are welcome. Before contributing, please check the Issues tab to see if your suggestion or improvement has already been discussed or is in progress.
To contribute:
- Fork the repository.
- Create a new branch for your feature or fix.
- Commit your changes with meaningful messages.
- Push your branch and submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
This project is intended for educational purposes and serves as a basic demonstration of retrieval-augmented generation concepts using a pure Python approach. It is not optimized for production use but rather aims to provide a clear and understandable implementation of these ideas.