INSPIRE Tag Extraction

This repository contains two Python scripts demonstrating how to extract research tags (e.g., hep-th) from INSPIRE HEP pages using Selenium. Because INSPIRE pages are dynamically rendered (JavaScript-based), traditional HTTP requests and parsing (e.g., requests + BeautifulSoup) will not reliably retrieve the DOM elements. Instead, we use Selenium to render pages and parse the final DOM.


Table of Contents


Google Sheets

The Google Sheet linked below presents the 2025 HEP Theory Postdoc Rumor Mill data after being processed through our bulk CSV parsing script. In particular, each entry in the rumor mill (e.g., a candidate, their institution, and acceptance status) has been enhanced with a new column, "Area," which reflects the research tag (such as hep-th, hep-ph, etc.) retrieved from the candidate’s INSPIRE profile.

Overview of Scripts

  1. Bulk CSV Parsing (bulk_csv_parser.py)

    • Reads a CSV file containing one or more INSPIRE links (in a column called "Inspire link").
    • Uses Selenium to visit each link, extracts the Area (i.e., the tag like hep-th) and adds it as a new column (Area).
    • Outputs a new CSV file appending "_with_area" to the original file's name.
  2. Interactive Tag Parser (interactive_parser.py)

    • Runs in a loop, asking the user to input an INSPIRE link.
    • Attempts to extract and print the research tag.
    • Exits on certain commands (quit, q, or exit).

Requirements

  • Python 3.7+ (or a version that supports Selenium well)
  • Google Chrome installed
  • ChromeDriver (matching your installed Chrome version) in your system PATH or placed in a known location
  • Python packages:

Installation and Setup

  1. Clone or download this repository:

    git clone https://github.com/Axect/INSPIRE-Tag-Extraction
    cd INSPIRE-Tag-Extraction
  2. Create and activate a virtual environment via uv (recommended):

    uv venv
    source .venv/bin/activate
  3. Install required packages:

    uv pip sync requirements.txt

    Or manually:

    uv pip install pandas selenium tqdm
  4. Install ChromeDriver:

    • Download ChromeDriver that matches your installed Google Chrome version from here.
    • Make sure it is placed in your system PATH or in the same directory as the scripts.

Usage

Script 1: Bulk CSV Parsing

File: bulk_csv_parser.py

Description: This script reads an input CSV, visits each "Inspire link" row, extracts the tag, and writes a new CSV with an added Area column.

  1. Run the script:
    python bulk_csv_parser.py --csv path/to/input.csv
  2. What it does:
    • It will open the CSV specified by --csv.
    • For each row, it extracts the tag from the provided Inspire link.
    • A new file (with _with_area.csv appended to the base name) will be created.

Example:

python bulk_csv_parser.py --csv "inspire_authors.csv"
# Creates "inspire_authors_with_area.csv"

Script 2: Interactive Tag Parser

File: interactive_parser.py

Description: This script opens a headless Chrome browser and interactively prompts the user to input INSPIRE links one at a time.

  1. Run the script:
    python interactive_parser.py
  2. What it does:
    • A loop prompts you to enter an INSPIRE link.
    • Once you enter the link, the script attempts to load the page, extract the <span class="ant-tag __UnclickableTag__"> text, and print it on the console.
    • Type quit, q, or exit to stop.

Additional Notes

  • If ChromeDriver cannot be found, you may need to specify the path explicitly. For example:
    driver = webdriver.Chrome(executable_path="/path/to/chromedriver", options=chrome_options)
  • If you encounter any issues with dynamic rendering delays, consider increasing the timeout in WebDriverWait(driver, timeout=XX).
  • The tqdm library is used in the CSV script to show progress bars. If you prefer, you can remove it or adjust the code to use standard print statements.

License

This project is licensed under the MIT License. Feel free to use and modify the code according to your needs.