Indexing Millions of Wikipedia Articles With Upstash Vector

This repository contains the code and documentation for our project on indexing millions of Wikipedia articles using Upstash Vector, as described in our blog post.

Project Overview

We've created a semantic search engine and Upstash RAG Chat SDK using Wikipedia data to demonstrate the capabilities of Upstash Vector and RAG Chat SDK. The project involves:

Preparing and embedding Wikipedia articles
Indexing the vectors using Upstash Vector
Building a Wikipedia semantic search engine
Implementing a RAG chatbot

Key Features

Indexed over 144 million vectors from Wikipedia articles in 11 languages
Used BGE-M3 embedding model for multilingual support
Implemented semantic search with cross-lingual capabilities
Created a RAG chatbot using Upstash RAG Chat SDK

Technologies Used

Upstash Vector: For storing and querying vector embeddings
Upstash Redis: For storing chat sessions
Upstash RAG Chat SDK: For building the RAG Chat application
SentenceTransformers: For generating embeddings
Meta-Llama-3-8B-Instruct: As the LLM provider through QStash LLM APIs

Contributing

We welcome contributions to improve this project. Please feel free to submit issues or pull requests.

Acknowledgements

Wikipedia for providing the dataset
Upstash for their vector database and RAG Chat SDK
All contributors to the open-source libraries used in this project

Contact

For any questions or feedback about the project or Upstash Vector, please reach out to us at (add contact information).

Check out our live demo to see the project in action!