This project's goal is to automatically load data from a Notion database into a vector database. It includes functionality to query Notion databases, convert pages to Markdown, split by headers, and process splits into vectorized formats. It's particularly useful for applications that require embedded representations of text.
This project is currently under development and not suitable for production use. Please note that the vector database resets after each restart and embedings recreated.
- Environment Variables: Manage configurations via environment variables.
- Notion Database Querying: Fetch page IDs from a specified Notion database.
- Markdown Export: Convert Notion pages to Markdown files.
- Text Splitting: Break down documents into manageable sections.
- Embedding Generation: Generate embeddings for textual content.
- Vector Store Management: Process and save documents in the vector store (FAISS).
- API Endpoint: Trigger ingestion through the
/ingest
API endpoint.
- Clone this repository.
- Install dependencies:
pip install -r requirements.txt
You need to set the following environment variables:
NOTION_TOKEN
: Notion token for authentication.NOTION_DATABASE_ID
: ID of the Notion database.LOG_LEVEL
: Log level (optional, default 'INFO').OPENAI_API_KEY
: OpenAI API key.NOTION_DATABASE_QUERY_FILTER
: JSON filter for querying the Notion database.
The application uses environment variables for configuration. You'll find a template for these variables in the .env.example
file.
Copy the .env.example
file to a new file named .env
:
cp .env.example .env
Open the .env
file in a text editor and update the values to match your configuration:
After setting these variables, run the script with:
python notion2vector/main.py
You can build the Docker image and run the container using the following commands:
docker build -t notion2vector .
docker run -p 4000:80 --env-file .env notion2vector
Once the application is running, it will automatically load the Notion database into the vector database. You can re-trigger the ingestion process by sending a POST request to the /ingest
endpoint:
curl -X POST http://localhost:4000/ingest
To use the vector store with other applications, you need to create a Docker persistent volume for the faiss
directory. This ensures that the data remains available between container restarts and can be shared with other containers.
Here's how to set it up:
Create a volume using Docker:
docker volume create --name=faiss_volume
When running the container, you need to mount the volume to the faiss
directory inside the container:
docker run -p 4000:80 --env-file .env -v faiss_volume:/app/faiss notion2vector
You can now mount this volume in other containers that need to query the FAISS index. Simply use the same volume name and mount it to the appropriate path within the other container:
docker run -v faiss_volume:/path/in/other/container other-image-name
Replace /path/in/other/container
with the appropriate path inside the other container where you want the FAISS data to be accessible.
Make sure the application is configured correctly to use the faiss
directory for storing the vector data, and the directory permissions are set appropriately for the container user.
- Pages content clean up.
- Process only updated pages.
- Remove Langchain dependency.
- Add more settings for MD spliting.
- More vector stores
Feel free to open issues or submit pull requests if you find any bugs or have suggestions for improvements.
MIT