Dir to Vector Search is a simple script which lets you vectorize a directory of markdown or text files, then search through them using sqlite-vss
.
The script will run through every markdown or text file in the provided directory, and add its content, as well as its embeddings to a sqlite database. The script uses OpenAI embeddings by default, but you can change this in ai.py
. In the future I may add support for multiple types of embeddings if people are interested.
Once you've run the script, you will be able to search the database as a normal database, or using vector similarity search. This means you can use it for things like Retrieval Augmented Generation, or just general search functionality.
To get set up, it's pretty simple. Just provide an OpenAI API Key in the environment. If you need to add an API key, add it on the OpenAI Platform.
Simply clone the .env.example
file in the root directory to a new file called .env
, and fill in the dummy values. The OpenAI API key is required to generate embeddings.
Once you've set that, you may also notice the other values in the .env.example
, which you can also customize based on your preference. Such as:
- "DEFAULT_FILE_LIMIT": The maximum amount of files that the script will collect before starting processes
- "DEFAULT_DELAY_PER_REQUEST": The amount of seconds which the program will wait after processing each file, useful for rate limiting
Don't forget to clone the dependencies in requirements.txt
. I'd recommend setting up a python virtual environment like so:
python -m venv ./dtv-env
Then activating it using the activate script:
source ./dtv-env/bin/activate
And finally running pip install
:
pip install -r requirements.txt
Using the script is as simple as running it with python.
After you've gotten everything setup, you can run the main script like so:
python main.py <path-to-target-directory>
This will create a database with vector search abilities, based on the directory provided. However, there are also some flags that you can use to customize the experience:
-l
,--limit
: The maximum amount of files that the script will collect before starting processes, overridesDEFAULT_FILE_LIMIT
-d
,--delay
: The amount of seconds which the program will wait after processing each file, useful for rate limiting, overridesDEFAULT_DELAY_PER_REQUEST
--no-delay
: Sets the delay to 0s between each request
Once you've generated the database, you can query it with the query.py
script:
python query.py <query>
Keep in mind, by default the vector similarity will be on the content
of each file, however you can switch it to content with the --on
flag:
python query.py <query> --on title
One thing to consider is rate-limiting. I wrote the script based on the official rate limits guidance from OpenAI. This is based on the 'free tier', but if you have a different tier then you can adjust the values accordingly. The possibility of rate limiting increases based on the number of files and the size of your files.