DocuParse is a powerful document processing tool designed to parse and extract text and table information from various types of files including PDF, DOCX, XLSX, and PPTX. This project leverages components from the open-source ragflow project, particularly the deepdoc library, which has been adapted and enhanced to meet the specific requirements of this tool.
- Multi-format support: Easily handle and extract content from
.docx
,.xlsx
,.pdf
, and.pptx
files. - Text and table extraction: Efficiently extract and process text and tables from documents.
- Open-source integration: Utilizes and extends functionality from the ragflow project for advanced document parsing capabilities.
Ensure you have the following installed:
- Python 3.7 or above
- pip (Python package installer)
It is recommended to use a virtual environment to manage your dependencies and avoid conflicts. You can set up a virtual environment as follows:
# Create a virtual environment
python3 -m venv env
# Activate the virtual environment
# On Windows
env\Scripts\activate
# On Linux/macOS
source env/bin/activate
Before using DocuParse, you'll need to install the required Python packages. Run the following command to install the dependencies listed in requirements.txt
:
pip install -r requirements.txt
To integrate DocuParse with the Google Gemini API(I recommend the Gemini1.5 Flash free tier), follow these steps:
Set your API key as an environment variable. Replace <YOUR_API_KEY>
with your actual API key.
export API_KEY=<YOUR_API_KEY>
set API_KEY=<YOUR_API_KEY>
$env:API_KEY = "<YOUR_API_KEY>"
To use DocuParse, you can execute the main.py
script, which automatically detects the file type and uses the appropriate parser.
You can run the tool from the command line as follows:
python main.py <file_path>
Replace <file_path>
with the path to your document.
python main.py /path/to/document.pdf
# python main.py '/Users/user/Downloads/Doctors and Health Orgs Feb 2023.pdf'
This will output the extracted text content from the specified document.
main.py
: The entry point of the tool, responsible for handling different document formats and initiating the appropriate parser.pdf_parser.py
: Contains the PDF parsing logic.docx_parser.py
: Contains the DOCX parsing logic.excel_parser.py
: Contains the XLSX parsing logic.pptx_parser.py
: Contains the PPTX parsing logic.deepdoc/
: Adapted components from the ragflow project for enhanced document parsing.
- The deepdoc and other document processing components are derived from the ragflow open-source project. These components have been modified to fit the needs of DocuParse.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
With DocuParse, you can effortlessly parse and extract valuable information from various document formats, all in a single, streamlined tool.