The aim of this project is to build a tool that can:
- Search for answers from an uploaded PDF(s)
- Filter the search on specific page components (e.g. headers, sub header, tables, paragraphs, etc.)
- Filter the search on specific pages
- Show the search results as PDF annotations
- Summarize the answers for the questions in a concise manner
The project will use Microsoft form recognizer to extract the text from the PDFs and then use a search engine to search for the answers. The search engine will be built using Azure Cognitive Search.The summarization will done using a Large Language Model from Azure OpenAI service.
The project is built using Azure Developer CLI (azd). The following is the README from the azd starter project.
A starter blueprint for getting your application up on Azure using Azure Developer CLI (azd). Add your application code, write Infrastructure as Code assets in Bicep to get your application up and running quickly.
The following assets have been provided:
- Infrastructure-as-code (IaC) Bicep files under the
infra
folder that demonstrate how to provision resources and setup resource tagging for azd. - A dev container configuration file under the
.devcontainer
directory that installs infrastructure tooling by default. This can be readily used to create cloud-hosted developer environments such as GitHub Codespaces. - Continuous deployment workflows for CI providers such as GitHub Actions under the
.github
directory, and Azure Pipelines under the.azdo
directory that work for most use-cases.
- app : Streamlit application host.
- infra : Bicep files for provisioning Azure resources.
- scripts : Powershell and shell scripts storage.
- data: Store files for the application
- User opens the Streamlit application.
- User sees Streamlit application UI.
- User uploads PDF file(s) to the application.
- File processing is triggered.
- A directory with the same name as the file is created in Azure Blob Storage to store the files.
- The original PDF file is saved to Azure Blob Storage under FileName/original
- The PDF file is divided into pages (JPGs) and saved to Azure Blob Storage under FileName/pages
- Text extraction is triggered.
- Text extraction is done using Azure Form Recognizer using Layout API.
- The original PDF file is sent to Azure Form Recognizer to extract the text.
- Form recognizer returns 4 types of data
- Page per text
- Paragraphs in the document along with x,y coordinates bounding box
- Tables in the document along with x,y coordinates bounding box
- Each of the above data is saved to Azure Blob Storage under FileName/extraction.
- Azure Cognitive Search is triggered.
- 3 indexes are created per document uploaded in the Azure cognitive search.
- Page per text
- Paragraphs in the document along with x,y coordinates bounding box
- Tables in the document along with x,y coordinates bounding box
- The data is indexed in the above indexes.
- The schema is created for the above indexes.
- 3 indexes are created per document uploaded in the Azure cognitive search.
- After step 2,3, and 4 are done User is shown the search UI.
- User can search for a keyword in the search bar.
- User can filter the search on the following:
- Page
- Page component (e.g. header, sub header, paragraph, table, etc.)
- Tables
- Search and OpenAI summarization is triggered.
- The search is done on the Azure Cognitive Search indexes.
- The search results are shown to the user as PDF annotations.
- The search results are summarized using Azure OpenAI service.
- Search results are shown to the user.
- Initialize the service source code projects anywhere under the current directory. Ensure that all source code projects can be built successfully.
-
Note: For
function
services, it is recommended to initialize the project using the provided quickstart tools.
-
- Once all service source code projects are building correctly, update
azure.yaml
to reference the source code projects. - Run
azd package
to validate that all service source code projects can be built and packaged locally.
Update or add Bicep files to provision the relevant Azure resources. This can be done incrementally, as the list of Azure resources are explored and added.
- A reference library that contains all of the Bicep modules used by the azd templates can be found here.
- All Azure resources available in Bicep format can be found here.
Run azd provision
whenever you want to ensure that changes made are applied correctly and work as expected.
Certain changes to Bicep files or deployment manifests are required to tie in application and infrastructure together. For example:
- Set up application settings for the code running in Azure to connect to other Azure resources.
- If you are accessing sensitive resources in Azure, set up managed identities to allow the code running in Azure to securely access the resources.
- If you have secrets, it is recommended to store secrets in Azure Key Vault that then can be retrieved by your application, with the use of managed identities.
- Configure host configuration on your hosting platform to match your application's needs. This may include networking options, security options, or more advanced configuration that helps you take full advantage of Azure capabilities.
For more details, see additional details below.
When changes are made, use azd to validate and apply your changes in Azure, to ensure that they are working as expected:
- Run
azd up
to validate both infrastructure and application code changes. - Run
azd deploy
to validate application code changes only.
Finally, run azd up
to run the end-to-end infrastructure provisioning (azd provision
) and deployment (azd deploy
) flow. Visit the service endpoints listed to see your application up-and-running!
The following section examines different concepts that help tie in application and infrastructure.
It is recommended to have application settings managed in Azure, separating configuration from code. Typically, the service host allows for application settings to be defined.
- For
appservice
andfunction
, application settings should be defined on the Bicep resource for the targeted host. Reference template example here. - For
aks
, application settings are applied using deployment manifests under the<service>/manifests
folder. Reference template example here.
Managed identities allows you to secure communication between services. This is done without having the need for you to manage any credentials.
Azure Key Vault allows you to store secrets securely. Your application can access these secrets securely through the use of managed identities.
For appservice
, the following host configuration options are often modified:
- Language runtime version
- Exposed port from the running container (if running a web service)
- Allowed origins for CORS (Cross-Origin Resource Sharing) protection (if running a web service backend with a frontend)
- The run command that starts up your service
- Azure Developer CLI
- Python 3+
- Important: Python and the pip package manager must be in the path in Windows for the setup scripts to work.
- Important: Ensure you can run
python --version
from console. On Ubuntu, you might need to runsudo apt install python-is-python3
to linkpython
topython3
.
- Node.js
- Git
- Powershell 7+ (pwsh) - For Windows users only.
- Important: Ensure you can run
pwsh.exe
from a PowerShell command. If this fails, you likely need to upgrade PowerShell.
- Important: Ensure you can run
- Powershell for Mac/Linux (pwsh)
- Install the Azure CLI
- Run
azd init -t azure-search-openai-demo
- Run
azd env refresh -e {environment name}
- Note that they will need the azd environment name, subscription Id, and location to run this command - you can find those values in your./azure/{env name}/.env
file. This will populate their azd environment's .env file with all the settings needed to run the app locally. - Run
pwsh ./scripts/roles.ps1
- This will assign all of the necessary roles to the user so they can run the app locally. If they do not have the necessary permission to create roles in the subscription, then you may need to run this script for them. Just be sure to set theAZURE_PRINCIPAL_ID
environment variable in the azd .env file or in the active shell to their Azure Id, which they can get withaz account show
. - az ad signed-in user show
NOTE: Your Azure Account must have
Microsoft.Authorization/roleAssignments/write
permissions, such as User Access Administrator or Owner.