Scraping exhibitors from camx 2022 (The Composites and Advanced Materials Expo).
Get a better, structured overview or do analysis on where the companies operate and in which fields.
The webdriver runs headless, which means the browser will actually not be visible. A couple of useful settings have been set:
"javascript.enabled", True
It is here required that Javascript is enabled."permissions.default.image", 2,
No images will be loaded."plugin.state.flash", 0
Flash is set deactivated."toolkit.telemetry.unified", False
Telemetry is deactivated.
The following data is being extracted and saved into a csv
spreadsheet:
- Company name
- Homepage URL
- Address
- Phone and Fax number
- Product categories (the sub-fields in which they operate)
- Description (a company's profile description)
After the development setup has been established (see below), just run it.
Prominent required external libraries are
- Selenium: https://github.com/SeleniumHQ/selenium
- Geckodriver https://github.com/mozilla/geckodriver
Selenium:
pip install selenium
Geckodriver: Download latest release and put it into your development folder, (i.e. C:/Users/yourUsername/Anaconda3). Make sure this path is set as environmental variable.
Author: Jonas Dossmann
Distributed under the AGPL-3.0 license.