This repository contains two Python scripts demonstrating how to extract research tags (e.g., hep-th
) from INSPIRE HEP pages using Selenium. Because INSPIRE pages are dynamically rendered (JavaScript-based), traditional HTTP requests and parsing (e.g., requests
+ BeautifulSoup
) will not reliably retrieve the DOM elements. Instead, we use Selenium to render pages and parse the final DOM.
- Google Sheets
- Overview of Scripts
- Requirements
- Installation and Setup
- Usage
- Additional Notes
- License
The Google Sheet linked below presents the 2025 HEP Theory Postdoc Rumor Mill data after being processed through our bulk CSV parsing script. In particular, each entry in the rumor mill (e.g., a candidate, their institution, and acceptance status) has been enhanced with a new column, "Area," which reflects the research tag (such as hep-th, hep-ph, etc.) retrieved from the candidate’s INSPIRE profile.
- 2025 PostDoc Rumor Mill with tags (Updated: 2025-02-19)
-
Bulk CSV Parsing (
bulk_csv_parser.py
)- Reads a CSV file containing one or more INSPIRE links (in a column called
"Inspire link"
). - Uses Selenium to visit each link, extracts the
Area
(i.e., the tag likehep-th
) and adds it as a new column (Area
). - Outputs a new CSV file appending
"_with_area"
to the original file's name.
- Reads a CSV file containing one or more INSPIRE links (in a column called
-
Interactive Tag Parser (
interactive_parser.py
)- Runs in a loop, asking the user to input an INSPIRE link.
- Attempts to extract and print the research tag.
- Exits on certain commands (
quit
,q
, orexit
).
- Python 3.7+ (or a version that supports Selenium well)
- Google Chrome installed
- ChromeDriver (matching your installed Chrome version) in your system PATH or placed in a known location
- Python packages:
-
Clone or download this repository:
git clone https://github.com/Axect/INSPIRE-Tag-Extraction cd INSPIRE-Tag-Extraction
-
Create and activate a virtual environment via uv (recommended):
uv venv source .venv/bin/activate
-
Install required packages:
uv pip sync requirements.txt
Or manually:
uv pip install pandas selenium tqdm
-
Install ChromeDriver:
- Download ChromeDriver that matches your installed Google Chrome version from here.
- Make sure it is placed in your system PATH or in the same directory as the scripts.
File: bulk_csv_parser.py
Description: This script reads an input CSV, visits each "Inspire link" row, extracts the tag, and writes a new CSV with an added Area
column.
- Run the script:
python bulk_csv_parser.py --csv path/to/input.csv
- What it does:
- It will open the CSV specified by
--csv
. - For each row, it extracts the tag from the provided
Inspire link
. - A new file (with
_with_area.csv
appended to the base name) will be created.
- It will open the CSV specified by
Example:
python bulk_csv_parser.py --csv "inspire_authors.csv"
# Creates "inspire_authors_with_area.csv"
File: interactive_parser.py
Description: This script opens a headless Chrome browser and interactively prompts the user to input INSPIRE links one at a time.
- Run the script:
python interactive_parser.py
- What it does:
- A loop prompts you to enter an INSPIRE link.
- Once you enter the link, the script attempts to load the page, extract the
<span class="ant-tag __UnclickableTag__">
text, and print it on the console. - Type
quit
,q
, orexit
to stop.
- If ChromeDriver cannot be found, you may need to specify the path explicitly. For example:
driver = webdriver.Chrome(executable_path="/path/to/chromedriver", options=chrome_options)
- If you encounter any issues with dynamic rendering delays, consider increasing the timeout in
WebDriverWait(driver, timeout=XX)
. - The
tqdm
library is used in the CSV script to show progress bars. If you prefer, you can remove it or adjust the code to use standard print statements.
This project is licensed under the MIT License. Feel free to use and modify the code according to your needs.