This project provides an interactive web-based tool that allows users to upload PDF documents, automatically extract tables and textual content, and interactively chat with the extracted data for insights. It aims to simplify the process of analyzing complex PDF documents such as bills or annual reports by offering automated extraction, visualization, and interactive querying.
To see the tool in action, check out the live demo on Streamlit. This demo showcases the interactive features of the app, including PDF uploading, data extraction, and chat functionality.
- PDF Upload: Easily upload PDF files through a web interface.
- Automated Extraction: Extract text and tables from the uploaded PDFs.
- Data Processing: Organize extracted data in a structured format for efficient querying.
- Interactive Chat: Engage with the extracted data using a natural language chatbot.
- Data Visualization: Generate and interact with visual representations of the extracted tables.
- User Feedback: Provide feedback to improve the tool's accuracy and usability.
- Python 3.8+
pip
(Python package installer)
streamlit
PyMuPDF
pdfplumber
tabula-py
pandas
matplotlib
seaborn
plotly
transformers
(Hugging Face)
- Clone the repository:
https://github.com/Harry262000/PDFTextExtraction.git
- Create a virtual enviorment and activate it:
python -m venv venv source venv/bin/activate
- install the required libraries:
pip install -r requirements.txt
- Run the streamlit app:
streamlit run app.py
- Upload a PDF:
Use the web interface to upload a PDF document.
- View Extracted Data:
Extracted text and tables will be displayed on the interface.
- Interactive Chat:
Engage with the chatbot to query the extracted data.
- Visualize Data:
View visual representations of the extracted tables.
app.py
: Main application file for Streamlit.pdf_extraction.py
: Handles PDF extraction logic.chatbot.py
: Manages chatbot interactions.visualization.py
: Contains code for data visualization.requirements.txt
: Lists required libraries and dependencies.
- "Show me the revenue for Q1."
- "What are the expenses listed in the table on page 3?"
- "Summarize the main points from the financial report."
- If you encounter any issues or have suggestions for improvements, please raise an issue in the
GitHub Issues section
of this repository.