- Introduction
- Beautiful Soup
- Basic Terms in Web Scraping
- Types of Parser
- Making a Soup from HTML File
- Making a Soup from Any Website HTML
- Analysis of HTML Tags
- Navigable Strings
- Navigating Through Tag Names
- Navigating Through Child Tags
- Navigating Down the HTML Parse Tree
- Navigating Up the HTML Parse Tree
- Navigating Sideways Through Siblings
The Beautiful Soup Introduction repository provides a comprehensive guide to web scraping fundamentals using Beautiful Soup, a Python library for pulling data out of HTML and XML files. This guide covers essential concepts, basic terms in web scraping, types of parsers, and practical examples using Beautiful Soup.
Beautiful Soup is a Python library designed for extracting data from HTML, XML, and other markup languages. It simplifies the process of web scraping by providing Pythonic idioms for iterating, searching, and modifying the parse tree.
- Crawler: A web bot that visits web pages and accumulates links (URLs) of nodes.
- Scraper: A bot that visits web pages of a given set of URLs, retrieving relevant data.
- Parser: An offline robot that processes or analyzes data to derive proper data structures.
- html.parser: Built-in, no extra dependencies needed.
- html5lib: The most lenient, better to use if HTML is broken.
- lxml: The fastest.
To create a soup (parse the HTML tree) using Beautiful Soup:
from bs4 import BeautifulSoup
def read_file():
file = open('intro_to_soup_html.html')
data = file.read()
file.close()
return data
html_file = read_file()
soup = BeautifulSoup(html_file, 'lxml')
Install the required library:
pip install fake_useragent
Use Beautiful Soup to parse HTML from a website:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
ua = UserAgent()
header = {'user-agent': ua.chrome}
google_page = requests.get('https://www.google.com', headers=header)
soup = BeautifulSoup(google_page.content, 'lxml')
print(soup.prettify())
Accessing and modifying HTML tags:
meta = soup.meta
div = soup.div
# Accessing attributes
print(meta.get("charset"))
print(meta["charset"])
# Modify attributes at runtime
body = soup.body
body['style'] = 'some style'
Accessing and modifying navigable strings:
title = soup.title
# Accessing string inside a tag
print(title.string)
# Replace navigable string
title.string.replace_with("title has been changed")
Accessing tags directly from their tag names:
title = soup.title
p = soup.p
print(title)
print(p)
Navigating through child tags using .contents
and .children
:
head = soup.head
body = soup.body
# Using .contents
print(head.contents)
print(body.contents)
# Using .children
for child in body.children:
print(child if child is not None else '')
Moving down the HTML parse tree using .parent
:
title = soup.title
parent = title.parent
print(parent)
print(parent.name)
Moving up the HTML parse tree using .parent
:
title = soup.title
p = soup.p
html = soup.html
print(title.parent.name)
print(html.parent) # returns None as it is at the top of the hierarchy
Moving sideways through siblings using .next_sibling
and .previous_sibling
:
p = soup.body.p
# Moving sideways
print(p.next_sibling.next_sibling)
# Moving upwards
print(body.previous_sibling.previous_sibling)
Feel free to explore these examples and enhance your understanding of Beautiful Soup for web scraping. Happy coding!
Get Started:
- Clone this repository:
git clone https://github.com/KIRAN-KUMAR-K3/Beautiful_Soup_Intro.git
- Explore the examples and dive into the world of web scraping with Beautiful Soup.