/GPT4oContentExtraction

Using Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents to Markdown

Primary LanguageJupyter NotebookMIT LicenseMIT

Azure OpenAI GPT-4o Content Extraction

Using Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents (PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, etc) to Markdown.

There is a lot if information contained within documents such as PDF's, PPT's, and Excel Spreadsheets beyond just text, such as images, tables and charts. The goal of this repo is to show how Azure OpenAI GPT 4o can be used to extract all of this information into a Markdown file to be used for downstream processes such as RAG (Chat on your Data) or Workflows.

Here is an example slide from the included PPT.

Original Slide

When converted to Markdown, notice how the charts are converted to Markdown tables which are easily understandable by Azure OpenAI GPT4. Output Markdown

Requirements

  • Azure OpenAI with GPT 4o enabled
  • Linux (Ubuntu) based Jupyter Notebook
  • (Optional) Azure AI Search - To test the ability to answer questions
  • (Optional) LibreOffice - IF you wish to support file types other than PDF

Processing Pipeline

Processing Pipeline

Geting Started

  1. Ensure you have installed requirements.txt
pip install -r requirements.txt
  1. Install LibreOffice by running libreoffice.ipynb

  2. Configure config.json with your Azure Service settings

  3. Convert the included sample PPT file by running convert-doc-to-markdown.ipynb. This will convert each page to a set of Markdown files.

(Optional Steps)

  1. Create an Azure AI Search Index to use for RAG based Chat over this content by running index-to-azure-ai-search.ipynb

  2. Perform a test RAG query by running test-query.ipynb

Test Query