Document summarizing engine

Query engine for a bunch of resume (.pdf)
Summarise huge document(s) (.docx) and output into a given datastructure

Structure

Copy the required documents to their appropriate folders,

Resume files in .pdf format to documents/resume
Report files in .docx format to documents/reports

main.py #Runs the dev server
document_summarizer.py #loads documents, splits, initialises vectordb, and queries for documents.
documents
- resume/*.pdf #For prototype 1, query for files based on criteria
- reports/*.docx #For prototype 2, summarising huge documents

Running it locally

Ideally, create a virtual environment by running
python -m venv .venv

python -m venv .venv

pip install -r requirements.txt
python main.py

Runs on http://127.0.0.1:8095

API

Request

METHOD: POST
HOST:         http://127.0.0.1:8095
Content-Type: application/json
RAW-BODY:

{
    "query": String,
    "doc_type": "resume" | "document"
}

Response

This structure works for 1st one, for summarising resume. The Second one, for summarising huge documents, needs more work to output the required data structure for consumption/generating a file in the requird format.

{
    "Match result criteria": String,
    "Query": String,
    "File": String, #Filename 
    "Reference": { 
        "Page Number": Number,
        "Content": String
    }
}

Example Request

{
    "query": "which resume would be a good fit for a marketing job?",
    "doc_type": "resume"
}

Example Response

{
    "File": "documents/resume/....-Resume.pdf",
    "Match result criteria": "Results -driven and motivated marketing professional with one year of experience and a recent Bachelor of Commerce degree",
    "Query": "which resume would be a good fit for a marketing job?",
    "Reference": {
        "Content": ".....",
        "Page Number": 0
    }
}

TODO (AFAIK)

Common

Play around with the pipeline parameters (temperature, top_p, etc) to tweak the results huggingface_pipelines

Second prototype - Summarising .docx files

Internally build a function to query the required fields for the output.
Return the required datastructure.

iamjayk/document-summarizer

Document summarizing engine

Query engine for a bunch of resume (.pdf)

Summarise huge document(s) (.docx) and output into a given datastructure

Structure

Running it locally

API

Request

Response

Example Request

Example Response

TODO (AFAIK)

Common

Second prototype - Summarising .docx files