/rag-python

My Py environment

Primary LanguageJupyter NotebookMIT LicenseMIT

Document indexing

Purpose

Create and populate Azure Search index with data from pdf files stored in blob storage. Provide simple chat interface to conduct q&a with OpenAI and Search to get answers to questions related to the stored documents.

Operation

See VSCode environment setup documentation to prepare your propject.

  1. Create a datasource in Azure Search to read your pdfs from a blob container
  2. Register a confidential client app in Entra ID and give it Search Index Contributor permission
  3. Setup your py environment with this repo, see VSCode environment setup documentation.
  4. Update .env file (see below) with your own settings
  5. Execute createIndex.py to create an index, skillset and indexer
  6. Run chatCompletions.py to enter questions and received answers from OpenAI
  7. Use Azure portal Azure Search Index view to execute queries

The skillset chunks the pdf docs into pages, hides some PII data, vectorizes the text content and uploads chunk to secondary index (document goes to primary).

Environment Setup

Portal

Grant Search Service Contributor to a Service Principal and allow index to use RBAc for authorization (key is default). Enable Semantic Ranker plan on the index.

Variables

Following environment variables need to be created in the .env file:

DATA_SOURCE_NAME="<anem of an existing Azure Search Data Source>"
INDEX_NAME=<name of Azure Search Index>
INDEXER_NAME="<name of an indexer to create>"
SKILLSET_NAME="<name of a skillset to create>"
SEARCH_SERVICE_NAME="<name of Azure Search Service>"
AZURE_OPENAI_ENDPOINT=https://<your ep>.openai.azure.com/
GTP_DEPLOYMENT="gtp-35-turbo-16k"
EMBEDDINGS_MODEL=<embedding deployment name>
OPENAI_API_KEY=...
SEARCH_API_KEY="<Search API key>"
AI_SERVICE_KEY="<API key for Azure Cognitive Service>"
AZURE_TENANT_ID="<Entra tenant id>"
AZURE_CLIENT_ID="<Confidential client id>"
AZURE_CLIENT_SECRET="<Confidential client secret>"

PIP Installs

See requirements.txt

pip install -r requirmeents.txt

Terminal env

python -m venv .venv

Install Tesseract for Confluence documenting reading**

If planning to read Confluence data: Error explanation and this

Install .exe for Confluence data reading Make sure to add this to PATH env variable

C:\Users\mrochon\AppData\Local\Programs\Tesseract-OCR
pip install pytesseract Pillow
pip install pytesseract

Code examples

Source Comments
createIndex.py Create new datasource, index, skillset and indexer
chatCompletions.py Simple REST based chat completion
chatCompletionsStream.py REST based chat completion with response streaming
---other---
confluenceDocReader.py Reads data from Confluence
chunkRecursive.py Break text into chunks using recursive chunking
chunkText.py Break text into chunks (using semantic chunking)
vectorize.py Create embedding vectors from text
createIndex.py Create Azure Search index using REST call
loadSampleDocs.py Load some docs to Azure Index (chunk, vectorize, upload) using REST call

References:

  1. Krystian Safjan's Chunking strategies
  2. Carlo C. Chunking strategies
  3. OpenAI REST API
  4. Py app sample

Environment

Use local python environment (@command:python.createEnvironment).

Could be done as Git Codespace except Tesserac requires own .exe, would need a new image.

py -m venv C:\Users\mrochon\source\repos\python 

Vision sample

visionCaption.py uses Azure Vision 4.0 REST API to create captions for objects found in a picture. It then sorts these captions by 'significance' - size of the object multiplied by recognition confidence + 1 (arbitrary way of increasing significance of confidence level). Below is the list it produced.

Some comments:

  1. Brand extraction seems not supported in 4.0, requires 3.2 and does not seem very reliable.
  2. Possible to train model with own brands
  3. Do not use blob urls with SAS - you will get a misleading error message (wrong API key or API version)
  4. There are two Vision services exposed in the market place: Custom Vision and Azure Vision. The former allows model training. Same API.

(Number is captioned object area * (1 + recognition confidence))

  • 3540664.4424562454 a white t-shirt with a logo on it
  • 3323911.0958576202 a white shirt with a logo on it
  • 1638713.3676481247 a white t-shirt with a logo on it
  • 84935.92342960835 a close up of a logo
  • 52640.99776518345 a wooden object with a black background
  • 47312.44461965561 a close-up of a sign
  • 22299.354930639267 a close up of a sign
  • 21066.452381253242 a close up of a colorful square
  • 9098.866596221924 a blue square with black lines
  • 8100.673599243164 a close up of an orange square
  • 1749111.6523742676 a can of paint with a white label
  • 1163405.9780507088 a can of paint with a label
  • 158812.61454582214 a close-up of a silver plate
  • 81225.97669053078 a close up of a logo
  • 65803.13216209412 a close up of a sign
  • 46963.74707400799 a blue letter on a white surface
  • 34378.030671179295 a blue shield with white text
  • 32410.553058743477 a close up of a label
  • 10311.295795440674 a blue sign with white letters
  • 8493.588054478168 a letter on a white surface
  • 656463.189125061 a screwdriver with yellow handle
  • 507116.565787375 a screwdriver with a yellow handle
  • 621655.7550430298 a blue machine with a fan
  • 518628.0614397526 a blue fan with a black circle
  • 99585.67106813192 a blue box with metal grate
  • 93088.14066690207 a close-up of a vent