A dynamic/static web scraping bot using Selenium 3 and Beautiful Soup 4
Table of Contents
I build this mini-project as I need a tool to scrape web data for my own use. I hope this simple tool can help you guys with your bigger project.
In order to be able to run the code, you will have to install some separate libraries, those libraries will be listed below in installation
I also give flow of the code and an example of how to modify your code to extract information from raw html
- Chrome Driver compatible to your Chrome version (or your preferred browser driver) (Remember your .exe path, we will use it)
- Selenium (different versions will have different syntax, in my project I use Selenium 4)
pip install -U selenium
- Beautiful Soup 4
pip install beautifulsoup4
- Download this repository .ipynb file and run
- Pass website link at website
- Pass your path of ChromeDriver.exe into path
website = 'your website' path = 'path to your chromedriver.exe file'
- OPTIONAL: there is one optional module that helps us to auto-scrolling the website, I set it as an infinite scroll because https://hcmut.edu.vn/danh-sach-tin-tuc requires scrolling to fetch new data (just stop by interrupting the cell)
- Extract information from website:
- Check for html structure of your desired website
- Check for the tag and class of your interested fields.
- Put it in the
soup.findall('yourtag', class_ = 'yourclass') (#this will return a list)
- Loop through the list (use find to get an element)
- There are 2 ways to save data:
- Save all data to csv file
- Save to separate txt files ( Remember to first create a folder to contain all the txt files )
- Further modification to your need
This is my scraped website's HMTL structure
In order to get the header content out:
headers = soup.find_all('h3', class_ = 'heading')
headerContent = []
for header in headers:
headerContent.append(header.find('p').text.lower())
print(headerContent)
Result:
'555 bộ bàn ghế học tập được trường đh bách khoa trao tặng cho ubnd huyện châu thành'
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
Distributed under the MIT License. See LICENSE.txt
for more information.
Oanh Tran - oanh.tranotsc1123@hcmut.edu.vn
Project Link: https://github.com/Oztobuzz/Web-Scraping-Bot