Document indexing

Purpose

Create and populate Azure Search index with data from pdf files stored in blob storage. Provide simple chat interface to conduct q&a with OpenAI and Search to get answers to questions related to the stored documents.

Operation

See VSCode environment setup documentation to prepare your propject.

Create a datasource in Azure Search to read your pdfs from a blob container
Register a confidential client app in Entra ID and give it Search Index Contributor permission
Setup your py environment with this repo, see VSCode environment setup documentation.
Update .env file (see below) with your own settings
Execute createIndex.py to create an index, skillset and indexer
Run chatCompletions.py to enter questions and received answers from OpenAI
Use Azure portal Azure Search Index view to execute queries

The skillset chunks the pdf docs into pages, hides some PII data, vectorizes the text content and uploads chunk to secondary index (document goes to primary).

Environment Setup

Portal

Grant Search Service Contributor to a Service Principal and allow index to use RBAc for authorization (key is default). Enable Semantic Ranker plan on the index.

Variables

Following environment variables need to be created in the .env file:

DATA_SOURCE_NAME="<anem of an existing Azure Search Data Source>"
INDEX_NAME=<name of Azure Search Index>
INDEXER_NAME="<name of an indexer to create>"
SKILLSET_NAME="<name of a skillset to create>"
SEARCH_SERVICE_NAME="<name of Azure Search Service>"
AZURE_OPENAI_ENDPOINT=https://<your ep>.openai.azure.com/
GTP_DEPLOYMENT="gtp-35-turbo-16k"
EMBEDDINGS_MODEL=<embedding deployment name>
OPENAI_API_KEY=...
SEARCH_API_KEY="<Search API key>"
AI_SERVICE_KEY="<API key for Azure Cognitive Service>"
AZURE_TENANT_ID="<Entra tenant id>"
AZURE_CLIENT_ID="<Confidential client id>"
AZURE_CLIENT_SECRET="<Confidential client secret>"

PIP Installs

See requirements.txt

pip install -r requirmeents.txt

Terminal env

python -m venv .venv

Install Tesseract for Confluence documenting reading**

If planning to read Confluence data: Error explanation and this

Install .exe for Confluence data reading Make sure to add this to PATH env variable

C:\Users\mrochon\AppData\Local\Programs\Tesseract-OCR

pip install pytesseract Pillow
pip install pytesseract

Code examples

Source	Comments
createIndex.py	Create new datasource, index, skillset and indexer
chatCompletions.py	Simple REST based chat completion
chatCompletionsStream.py	REST based chat completion with response streaming
---other---
confluenceDocReader.py	Reads data from Confluence
chunkRecursive.py	Break text into chunks using recursive chunking
chunkText.py	Break text into chunks (using semantic chunking)
vectorize.py	Create embedding vectors from text
createIndex.py	Create Azure Search index using REST call
loadSampleDocs.py	Load some docs to Azure Index (chunk, vectorize, upload) using REST call

References:

Environment

Use local python environment (@command:python.createEnvironment).

Could be done as Git Codespace except Tesserac requires own .exe, would need a new image.

py -m venv C:\Users\mrochon\source\repos\python

Vision sample

visionCaption.py uses Azure Vision 4.0 REST API to create captions for objects found in a picture. It then sorts these captions by 'significance' - size of the object multiplied by recognition confidence + 1 (arbitrary way of increasing significance of confidence level). Below is the list it produced.

Some comments:

Brand extraction seems not supported in 4.0, requires 3.2 and does not seem very reliable.
Possible to train model with own brands
Do not use blob urls with SAS - you will get a misleading error message (wrong API key or API version)
There are two Vision services exposed in the market place: Custom Vision and Azure Vision. The former allows model training. Same API.

(Number is captioned object area * (1 + recognition confidence))

Image 1

3540664.4424562454 a white t-shirt with a logo on it
3323911.0958576202 a white shirt with a logo on it
1638713.3676481247 a white t-shirt with a logo on it
84935.92342960835 a close up of a logo
52640.99776518345 a wooden object with a black background
47312.44461965561 a close-up of a sign
22299.354930639267 a close up of a sign
21066.452381253242 a close up of a colorful square
9098.866596221924 a blue square with black lines
8100.673599243164 a close up of an orange square

Image 2

1749111.6523742676 a can of paint with a white label
1163405.9780507088 a can of paint with a label
158812.61454582214 a close-up of a silver plate
81225.97669053078 a close up of a logo
65803.13216209412 a close up of a sign
46963.74707400799 a blue letter on a white surface
34378.030671179295 a blue shield with white text
32410.553058743477 a close up of a label
10311.295795440674 a blue sign with white letters
8493.588054478168 a letter on a white surface

Image 3

656463.189125061 a screwdriver with yellow handle
507116.565787375 a screwdriver with a yellow handle

Image 4

621655.7550430298 a blue machine with a fan
518628.0614397526 a blue fan with a black circle
99585.67106813192 a blue box with metal grate
93088.14066690207 a close-up of a vent

mrochon/rag-python