CRAJobHarvester is a Python-based tool designed to scrape and analyze job listings from the Computing Research Association (CRA) website. It utilizes web scraping and OpenAI's language models to extract and structure job information.
- Scrapes job listings from the CRA website
- Uses OpenAI's GPT models to parse and structure job details
- Saves results in a CSV file
- Avoids duplicate entries
- Python 3.7 or higher
- Chrome browser
- ChromeDriver
-
Clone this repository:
git clone https://github.com/ZhangZhuoSJTU/CRAJobHarvester.git cd CRAJobHarvester
-
Install the required Python packages:
pip install -r requirements.txt
-
Download ChromeDriver:
- Visit the ChromeDriver downloads page
- Download the version that matches your Chrome browser version
- Extract the executable and note its path
Run the script with the following command:
python cra_job_crawler.py --csv output.csv --api_key your_openai_api_key --chromedriver /path/to/chromedriver --additional_links 5 --log_level INFO
--csv
: Path to the CSV file for output and duplicate checking (default: cra_job_listings.csv)--api_key
: Your OpenAI API key--model
: OpenAI model to use (choices: gpt-3.5-turbo, gpt-4, gpt-4o; default: gpt-3.5-turbo)--chromedriver
: Path to your ChromeDriver executable (required)--additional_links
: Number of additional links to process per job listing (default: 3)--max_attempts
: Maximum number of attempts for parsing job details (default: 3)--log_level
: Logging level (choices: DEBUG, INFO, WARNING, ERROR, CRITICAL; default: INFO)
The script generates a CSV file containing the following information for each job listing:
- Company/University
- Department
- Position (Assistant Professor, Associate Professor, etc.)
- Hiring Areas
- Location
- Number of Positions
- Submission Deadline
- Number of Recommendation Letters
- Expiration Date
- CRA Link
- Crawl Time
- Posted Date
- Additional Links
- Additional Comments
The script uses a custom logging setup with colored output for console logs and detailed logs saved to a file. The log file (cra_job_crawler.log) uses a rotating file handler to manage log size.
If you encounter any issues:
- Check that your Chrome WebDriver is compatible with your Chrome browser version.
- Ensure your OpenAI API key is correctly set and has sufficient credits.
- Review the log file for detailed error messages.
- Adjust the log level for more detailed output if needed.
This tool has been primarily tested with the GPT-3.5-turbo model.
Contributions to CRAJobHarvester are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This tool is for educational and research purposes only. Please respect the CRA website's terms of service and use this tool responsibly.