/search-ml

Repo for ML, Chatbots, Semantic Search Demos

Primary LanguageJupyter Notebook

MongoDB Balance Sheet Analysis

This Jupyter Notebook provides a step-by-step guide to analyze a PDF file containing financial data, specifically a MongoDB balance sheet. The notebook utilizes various Python libraries to extract structured information from the PDF, process it, and perform text-based queries on the data.

Table of Contents

  1. Introduction
  2. Setup
  3. PDF Extraction and Processing
  4. Text Analysis
  5. Questions and Answers

1. Introduction

The objective of this notebook is to extract financial information from a MongoDB balance sheet PDF, convert it into structured data, and then perform text-based queries on the extracted information.

2. Setup

Before running the code, ensure that you have the required Python libraries installed. The following libraries can be installed using the provided code cell:

!pip3 install unstructured unstructured-inference langchain numpy tesseract cohere chromadb

3. PDF Extraction and Processing

In this section, the PDF is analyzed to extract table structures and other relevant elements. The key steps include:

  • Installing necessary libraries.
  • Specifying the PDF file path.
  • Defining extraction parameters such as the strategy and model.
  • Extracting elements from the PDF.
  • Storing the results in a JSON file.

4. Text Analysis

This section focuses on processing the JSON file containing extracted data. The steps involve:

  • Reading the JSON file.
  • Extracting table elements.
  • Writing the extracted elements to a text file for further analysis.

5. Questions and Answers

The final section utilizes the processed text data to answer specific questions related to the MongoDB balance sheet. The steps include:

  • Loading the text data.
  • Splitting the text into manageable chunks.(Right now chunks are large and can be optimized further)
  • Performing text embeddings using Cohere.
  • Building a retrieval-based question-answering model.
  • Asking questions and obtaining answers based on the processed data.

Example Questions:

  1. "What are the current assets of MongoDB?"
  2. "What is the total cash flow from investing activities of MongoDB?"
  3. "How many total Atlas Customers MongoDB has in July 2023?"
  4. "What was the gross margin in July 2023?"

The notebook generates answers to these questions and provides them as output.


Note: Make sure to replace <COHERE_API_KEY> with your actual Cohere API key in the notebook's code where required.

For any questions or issues, please refer to the code comments or documentation of the respective libraries used in this notebook.