Addgene Webscraper

This repository contains Python scripts for scraping plasmid data from the Addgene website. It includes two main scripts: webscraper.py for collecting general plasmid information, and plasmid_scraper.py for extracting full gene sequences.

Features

webscraper.py

Scrapes multiple pages of plasmid data from Addgene
Extracts various fields including title, catalog number, purpose, depositor, and more
Handles pagination automatically
Saves data to a CSV file for easy analysis

plasmid_scraper.py

Loads plasmid catalog numbers from a CSV file
Scrapes full gene sequences for each plasmid
Supports batch processing (currently set to the first 100 plasmids)
Saves extracted sequences to a new CSV file

Requirements

Python 3.x
BeautifulSoup4
Selenium
Chrome WebDriver
Pandas

Installation

Clone this repository:

git clone https://github.com/yourusername/addgene-webscraper.git
cd addgene-webscraper

Install the required Python packages:

pip install beautifulsoup4 selenium pandas

Download and install the Chrome WebDriver that matches your Chrome browser version.

Usage

Run the script:

python webscraper.py

The script will start scraping data from Addgene, beginning with the first page of empty backbone plasmids.
Once complete, the data will be saved to plasmid_data.csv in the same directory.

plasmid_scraper.py

Ensure you have a CSV file named plasmid_data.csv with a column containing plasmid catalog numbers.
Update the file path in the main() function to specify where you want to save the output.
Run the script:

python plasmid_scraper.py

The script will extract full gene sequences for the specified plasmids and save the results to the specified CSV file.

Output

webscraper.py

Generates plasmid_data.csv with columns:

Title, Catalog Number, Purpose, Depositor, Article Name, Article Link, Type, Use, Expression, Promoter, Availability, Mutation, Tags

plasmid_scraper.py

Generates a CSV file (path specified in the script) with columns:

Catalog Number, Full Gene Seq

Notes

webscraper.py is set to stop after reaching page 180 or when there are no more pages to scrape.
plasmid_scraper.py is currently set to process only the first 100 plasmids. Adjust this limit in the extract_seq() function if needed.
Web scraping may be subject to Addgene's terms of service. Ensure you have permission to scrape their website.
Use responsibly and consider implementing delays between requests to avoid overloading the server.

Contributing

Contributions, issues, and feature requests are welcome. Feel free to check issues page if you want to contribute.

License

MIT

JibraanK/addgene-webscraper

Addgene Webscraper

Features

webscraper.py

plasmid_scraper.py

Requirements

Installation

Usage

plasmid_scraper.py

Output

webscraper.py

plasmid_scraper.py

Notes

Contributing

License