SupaSeqs is a tool that can be used to manage DNA sequences databases locally, thanks to the PostgreSQL implementation offered by Supabase.
It leverages PostgreSQL as backend database manager, kmer-based vectorization and vector search to mimic the functionalities of BLAST.
If you are working in a Linux environment, you may want to just download/copy setup.sh and launch it:
# Linux
wget https://raw.githubusercontent.com/AstraBert/SupaSeqs/main/scripts/setup.sh
bash setup.sh
Make sure that your environment has:
git
Node v18
or followingnpm
andnpx
python 3.10
or following The installation process should work both on Windows and on Linux.
First of all, clone this repository:
# BOTH Windows and Linux
git clone https://github.com/AstraBert/SupaSeqs
cd SupaSeqs
Get the supabase
command line executables:
# BOTH Windows and Linux
npm install supabase
Create and start a Supabase instance:
# BOTH Windows and Linux
npx supabase init
npx supabase start
Retrieve the connection string from the DB URL
that will be printed after this command:
# BOTH Windows and Linux
npx supabase status
Create a virtual environment, activate it and install the necessary dependencies:
# Linux
python3 -m venv apienv
source apienv/bin/activate
python3 -m pip install -r requirements.txt
Or
# Windows
python3 -m venv .\apienv
.\apienv\Scripts\activate # For Command Prompt
# or
.\apienv\Activate.ps1 # For PowerShell
python3 -m pip install -r .\requirements.txt
Within the virtual environment, run:
# BOTH Windows and Linux
cd scripts
python3 -m fastapi dev
If there are problems with the connection to the Supabase client, make sure to replace the connection string in line of 16 main.py
with the one you found running supabase status
.
The application works as an API service, leveraging FastAPI. The connection to Supabase is handled via a sqlalchemy
implementation of a client which is similar to the one built in the vecs
library.
The application accepts two request types:
1- POST - Upload a sequence or a FASTA file:
# Single sequence
curl -X POST "http://127.0.0.1:8000/seqs/" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"sequence\":\"GGCAGAACCCAGGGCACCAGCACGCCGAAGGACCACCGCAGGCTGGCCAGCGCTCCACCCTCCCTGCACCACACCCTGCGAGCAAAAGGCAGCAGAAATGAAGAGCATTTACTTTGTGGCTGGATTGTTTGTAATGCTGGTACAAGGCAGCTGGCAACACCCACTTCAAGACACAGAGGAAAAACCCAGGTCTTTCTCAACTTCTCAAACAGACTTGCTTGATGATCCGGATCAGATGAATGAAGACAAGCGTCATTCACAGGGTACATTCACCAGTGACTACAGCAAGTTCCTCGACACCAGGCGTGCTCAAGACTTCTTGGATTGGCTGAAGAACACCAAGAGGAACAGGAATGAAAT\", \"description\": \"M57688.1 Octodon degus glucagon mRNA, complete cds\"}"
# FASTA file
curl -X POST "http://127.0.0.1:8000/seqs/" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"sequence\": \"sequence.fasta\"}"
Each sequence gets vectorized with a 5-mer-based representation (a 1024-dim array), which is then uploaded to the sequences
table on Supabase along with a description (if provided in the case of the single sequence, the headers of the sequences for those in a FASTA file) and the original sequence.
2- GET - Search through the sequence database
curl -X 'GET' 'http://localhost:8000/seqs/AACTTCTCAAACAGACTTGCTTGATGATCCGGATCAGATGAATGAAGACAAGCGTCATTCACAGGGTACATTCACCAGTGACTACAGCAAGTTCCTCGACACCAGGCGTGCTCAAGACTTCTTGGATTGGCTGAAGAACACCAAGAGGAACAGGAATGAAAT?limit=100&threshold=75' -H 'accept: application/json'
The query sequence gets vectorized and the database is searched: a number of sequences (specified with the limit key, maximum is 1000) is returned if they are compliant with a similarity threshold (specified as a percentage value with the threshold key); the typical response looks like this:
{"1":{"sequence":"GGCAGAACCCAGGGCACCAGCACGCCGAAGGACCACCGCAGGCTGGCCAGCGCTCCACCCTCCCTGCACCACACCCTGCGAGCAAAAGGCAGCAGAAATGAAGAGCATTTACTTTGTGGCTGGATTGTTTGTAATGCTGGTACAAGGCAGCTGGCAACACCCACTTCAAGACACAGAGGAAAAACCCAGGTCTTTCTCAACTTCTCAAACAGACTTGCTTGATGATCCGGATCAGATGAATGAAGACAAGCGTCATTCACAGGGTACATTCACCAGTGACTACAGCAAGTTCCTCGACACCAGGCGTGCTCAAGACTTCTTGGATTGGCTGAAGAACACCAAGAGGAACAGGAATGAAAT","description":"M57688.1 Octodon degus glucagon mRNA, complete cds","cos_dist":0.23987939711631145}}
This is accomplished thanks to a function called match_page_sections
and defined as follows:
create or replace function public.match_page_sections (
embedding vector(1024),
match_threshold float,
match_count int
)
returns setof public.sequences
language sql
as $$
select *
from public.sequences
where public.sequences.embedding <=> embedding < 1 - match_threshold
order by public.sequences.embedding <=> embedding asc
limit least(match_count, 1000);
$$;
Contributions are more than welcome! See contribution guidelines for more information :)
If you found this project useful, please consider to fund it and make it grow: let's support open-source together!😊
This project is provided under MIT license: it will always be open-source and free to use.
If you use this project, please cite the author: Astra Clelia Bertelli