Web-Scraping-Bot: A Jupyter Notebook repository from Oztobuzz

WEB SCRAPING BOT

A dynamic/static web scraping bot using Selenium 3 and Beautiful Soup 4

Table of Contents

About The Project
- Built With
Getting Started
- Installation
Usage
Contributing
License
Contact
Acknowledgments

About The Project

I build this mini-project as I need a tool to scrape web data for my own use. I hope this simple tool can help you guys with your bigger project.

(back to top)

Built With

(back to top)

Getting Started

In order to be able to run the code, you will have to install some separate libraries, those libraries will be listed below in installation

I also give flow of the code and an example of how to modify your code to extract information from raw html

Installation

Chrome Driver compatible to your Chrome version (or your preferred browser driver) (Remember your .exe path, we will use it)
Selenium (different versions will have different syntax, in my project I use Selenium 4)
```
pip install -U selenium
```
Beautiful Soup 4
```
pip install beautifulsoup4
```
Download this repository .ipynb file and run

(back to top)

Usage

Pass website link at website

Pass your path of ChromeDriver.exe into path

website = 'your website'
path = 'path to your chromedriver.exe file'

OPTIONAL: there is one optional module that helps us to auto-scrolling the website, I set it as an infinite scroll because https://hcmut.edu.vn/danh-sach-tin-tuc requires scrolling to fetch new data (just stop by interrupting the cell)
Extract information from website:
1. Check for html structure of your desired website
2. Check for the tag and class of your interested fields.
3. Put it in the
```
soup.findall('yourtag', class_ = 'yourclass') (#this will return a list)
```
1. Loop through the list (use find to get an element)
There are 2 ways to save data:
1. Save all data to csv file
2. Save to separate txt files ( Remember to first create a folder to contain all the txt files )
Further modification to your need

Example

This is my scraped website's HMTL structure

In order to get the header content out:

headers = soup.find_all('h3', class_ = 'heading')
headerContent = []
for header in headers:
  headerContent.append(header.find('p').text.lower())
print(headerContent)

Result:

'555 bộ bàn ghế học tập được trường đh bách khoa trao tặng cho ubnd huyện châu thành'

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Oanh Tran - oanh.tranotsc1123@hcmut.edu.vn

Project Link: https://github.com/Oztobuzz/Web-Scraping-Bot

(back to top)

Acknowledgments

(back to top)