Interactive PDF Analysis and Extraction Tool

Overview

This project provides an interactive web-based tool that allows users to upload PDF documents, automatically extract tables and textual content, and interactively chat with the extracted data for insights. It aims to simplify the process of analyzing complex PDF documents such as bills or annual reports by offering automated extraction, visualization, and interactive querying.

Live Interaction

To see the tool in action, check out the live demo on Streamlit. This demo showcases the interactive features of the app, including PDF uploading, data extraction, and chat functionality.

Features

PDF Upload: Easily upload PDF files through a web interface.
Automated Extraction: Extract text and tables from the uploaded PDFs.
Data Processing: Organize extracted data in a structured format for efficient querying.
Interactive Chat: Engage with the extracted data using a natural language chatbot.
Data Visualization: Generate and interact with visual representations of the extracted tables.
User Feedback: Provide feedback to improve the tool's accuracy and usability.

Installation

Prerequisites

Python 3.8+
pip (Python package installer)

Libraries and Tools

streamlit
PyMuPDF
pdfplumber
tabula-py
pandas
matplotlib
seaborn
plotly
transformers (Hugging Face)

Steps to Install

Clone the repository:

https://github.com/Harry262000/PDFTextExtraction.git

Create a virtual enviorment and activate it:

python -m venv venv
source venv/bin/activate

install the required libraries:
```
pip install -r requirements.txt
```

Usage

Run the streamlit app:
```
streamlit run app.py
```
Upload a PDF:
- Use the web interface to upload a PDF document.
View Extracted Data:
- Extracted text and tables will be displayed on the interface.
Interactive Chat:
- Engage with the chatbot to query the extracted data.
Visualize Data:
- View visual representations of the extracted tables.

File Structure

app.py: Main application file for Streamlit.
pdf_extraction.py: Handles PDF extraction logic.
chatbot.py: Manages chatbot interactions.
visualization.py: Contains code for data visualization.
requirements.txt: Lists required libraries and dependencies.

Example Queries

"Show me the revenue for Q1."
"What are the expenses listed in the table on page 3?"
"Summarize the main points from the financial report."

Issues and Suggestions

If you encounter any issues or have suggestions for improvements, please raise an issue in the GitHub Issues section of this repository.

Harry262000/Pdf-chunks-Extractor