- Ability to upload large files into a database -DONE
- While generating the answer, take a reference to multiple uploaded files. The response must use the first 5 most similar content files. -DONE
- Read about Chunking (for uploading large amounts of data) -DONE
- Read about background tasks (implement background tasks) -DONE
- Chunking embedding refers to the process of breaking down large pieces of text into smaller, more manageable chunks before generating embeddings for each chunk. Embeddings are vector representations of text used in natural language processing (NLP) models to capture the semantic meaning of the text.
-
Prerequisites Python 3.7+ installed on your system. PostgreSQL installed and running. Virtual Environment (optional but recommended).
-
Clone the Repository Clone the project repository to your local machine:
git clone <repository-url> cd <repository-directory>
-
Set Up a Virtual Environment (Optional) Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install Required Packages Install the dependencies listed in
requirements.txt
:pip install -r requirements.txt
If
requirements.txt
is not available, install the dependencies manually:pip install fastapi aiofiles psycopg2-binary sentence-transformers scikit-learn pyPDF2 transformers python-multipart
-
Set Up PostgreSQL Database
- In order to get postgres up and running. Go to the
pdf_management
directory and rundocker-compose up -d
- Create a table named
pdf_embeddings
with the following structure:CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE pdf_embeddings ( ID SERIAL PRIMARY KEY, filename TEXT UNIQUE, embeddings VECTOR(384) -- Adjust the dimensionality based on your embeddings );
-
Run the Application Start the FastAPI server:
uvicorn main:app --reload
- Access the application at
http://127.0.0.1:8000
. - API documentation will be available at
http://127.0.0.1:8000/docs
.
- Access the application at
-
Here is the flow of Application APIs