I developed this tool to streamline the process of conducting mass keyword searches in literature during my PhD. It allows for efficient identification of keywords across multiple PDF files in a single directory, saving time and effort. The tool generates a CSV file summarizing the matched keywords and their corresponding file names, enabling easy analysis and organization.
- Extracts text from PDF files using the
pdfplumber
library. - Preprocesses text to remove line breaks, hyphens, and extra spaces.
- Searches for user-specified keywords within the extracted text.
- Saves the results (matching filenames and keywords) to a CSV file.
Ensure you have the following installed on your system:
- Python 3.7 or higher
- pip (Python package installer)
Install the necessary Python libraries using pip:
pip install pdfplumber pandas
The script requires the following command-line arguments:
- pdf_directory: Path to the directory containing the PDF files.
- keywords: Comma-separated list of keywords to search for.
- csv_path: Path to save the output CSV file.
To search for keywords "Python" and "Data" in PDF files located in /path/to/pdfs
and save the results to /path/to/output.csv
, run:
python pdf_keyword_finder.py "/path/to/pdfs" "Python,Data" "/path/to/output.csv"
- Integrate the OpenAI API to enhance usability and provide more accurate keyword matching by leveraging advanced natural language processing capabilities.
- Improve the tool's compatibility with scanned PDFs by incorporating Optical Character Recognition (OCR) functionality to handle files without extractable text.
This project is licensed under the MIT License. See the LICENSE file for details.
Feel free to star, fork the repository, create issues, or submit pull requests to improve this tool. Suggestions and feedback are VERY welcome!