This project is my side projec of the implementation of an AI-powered Enterprise RAG (Retrieval-augmented generation). It uses a pre-trained model to generate embeddings for books and then uses Elasticsearch to index and search for books by using multi-modal search:
- traditional text search
- 🧮 consine similarity search using embeddings (meaning books are recommended based on not just key words but semantic, user preferences, etc. which are all embedded as a vector)
- I did not choose a vector database as elasticsearch provides vector storage and search capabilities. It is not as good as a vector database but it is good enough for this project. Milvus is a good alternative if you want to use a vector database.
- For the big firms with more resources, the perfect stack should be: Pytorch + ONNX for model development, FastAPI + Docker for deployment, and RAY + Grafana for lifecycle MLOps with
pickle
If you run this project locally after git clone
, indexing and searching part only uses a small sample dataset as I want the interviewer (or anyone who is interested in using it) to run the code on their machine and see the results. It takes time to share a parquet file with 1.5M records and its embeddings. The online version is using the full dataset.
If you haven't tried onnx before, please check it out. It is a great way to deploy your models in production if you care about performance in production.
- Python3.10.10
- Docker (>24.0.5 should work)
- Docker-compose
# check your python version
# recommend using pyenv to manage python versions
python --V # should be >= 3.10.10
python -m venv venv
source venv/bin/activate
make install
make onnx
: construct onnx modelmake elastic-up
: start Elasticsearchmake index-books
: index books (might need to run this several times as elasticsearch might not be ready)make run
: start FastAPI server
make test
The port might be different if you have already running services on port 8080
TODO: Add deployment instructions
It uses fastapi-cookiecutter template. The project structure is as follows:
.
├── app
│ ├── api
│ ├── core
│ ├── __init__.py
│ ├── main.py
│ ├── models
│ ├── __pycache__
│ ├── services
│ └── templates
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── ml
│ ├── data
│ ├── features
│ ├── __init__.py
│ ├── model
│ └── __pycache__
├── notebooks
│ ├── construct_sample_dataset.ipynb
│ └── onnx_runtime.ipynb
├── poetry.lock
├── pyproject.toml
├── README.md
├── search
│ ├── books_embeddings.csv
│ ├── docker-compose.yml
│ └── index_books.py
├── tests
│ ├── __init__.py
│ ├── __pycache__
│ ├── test_api.py
│ ├── test_elastic_search.py
│ └── test_onnx_embedding.py
Originally, the data is downloaded from Goodreads Book Graph Datasets. The author also provides the code to download the data.
I downloaded the data and uploaded it to my Google Cloud Storage bucket. Please let me know if you found above links are broken and I will provide you with the data.
There are many tables in the dataset, but we are only interested in the following tables:
- books: detailed meta-data about 2.36M books
- reviews: Complete 15.7m reviews (~5g) and 15M records with detailed review text