fetcharoo is a Python library for downloading PDF files from a webpage. It provides support for specifying recursion depth and offers the option to merge downloaded PDFs into a single file.
- Download PDF files from a specified webpage.
- Specify recursion depth to control how many levels of links to follow when searching for PDFs.
- Choose to merge downloaded PDFs into a single file or store them as separate files.
- Simple and easy-to-use Python interface.
- Python 3.10 or higher
- Third-party libraries:
requests
,PyMuPDF
You can install fetcharoo using pip:
pip install fetcharoo
If you are using Poetry to manage your project, you can install fetcharoo as a dependency:
poetry add fetcharoo
If you don't have Poetry installed, you can install it by following the instructions on the official Poetry website.
To get started with fetcharoo, follow these steps:
- Install the library using pip or Poetry (see the Installation section above).
- Import the
download_pdfs_from_webpage
function from thefetcharoo
module. - Use the function to download PDFs from a webpage, specifying the URL, recursion depth, mode (merge or separate), and output directory.
Here's a basic example:
from fetcharoo import download_pdfs_from_webpage
# Download PDFs from a webpage and merge them into a single file
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1,
mode='merge',
output_dir='output'
)
fetcharoo provides additional options for customizing the behavior of the library:
- To download PDFs and store them as separate files, set the
mode
parameter toseparate
:
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1,
mode='separate',
output_dir='output'
)
To control the recursion depth, adjust the recursion_depth
parameter. For example, to follow links up to two levels deep, set recursion_depth
=2.
Contributions to fetcharoo are welcome! If you'd like to contribute, please follow these steps:
- Fork the repository on GitHub.
- Create a branch for your changes.
- Make your changes and commit them to your branch.
- Submit a pull request with your changes.
- We appreciate any contributions, whether it's fixing bugs, adding new features, or improving documentation.
If you encounter any issues or have questions about using fetcharoo, please open an issue on the GitHub repository. We'll do our best to assist you.
Please refer to the CHANGELOG.md file for a summary of changes in each release.
fetcharoo was developed by Mark Lifson. I'd like to thank all contributors and users for their support.
This project is licensed under the MIT License. See the LICENSE file for details. The MIT License allows for broad permissions, including use, modification, distribution, and sublicensing of the software.
Added new features to fetcharoo:
merge_pdfs
function to merge multiple PDFs into a single file