How to Parse HTML with PyQuery: Python Tutorial

In this article, we will learn how to write a web scraper in Python using the PyQuery Library. We will first explore the basics and after that, we will also do a comparison of PyQuery with Beautifulsoup. So, let’s get started.

What is PyQuery?

PyQuery is a Python library that allows you to manipulate and extract data from HTML and XML documents. It provides a jQuery-like syntax and API, making it easy to work with web content in Python.

Like jQuery, PyQuery allows you to select elements from an HTML or XML document using CSS selectors, and then manipulate or extract data from those elements. You can use PyQuery to parse and manipulate HTML and XML documents, scrape web pages, and extract data from web APIs.

How to install PyQuery

To install PyQuery, you will need to have Python installed on your machine. If you don't have Python installed, you can download and install it from the official Python website.

Once you have Python installed, you can download and install the PyQuery library using pip. To do this, open a terminal or command prompt and type the following command:

python -m pip install pyquery

This will install pyquery with all the necessary dependencies. If you get any errors, check out the official pyquery documentation

Parsing DOM

Let’s write our first scraper using PyQuery. We will use the requests module to fetch the website and parse it using PyQuery Module. Let’s go ahead and import the necessary libraries:

import requests
from pyquery import PyQuery as pq
 
Now, let's fetch the website: https://example.com and grab the title using pyquery.

r = requests.get("https://example.com")
doc = pq(r.content)
print(doc("title").text())

Now, if we run this code, it will print the title of the website. Notice, we are using the get method to grab the website content. And, then using the PyQuery class we parsed the whole content and stored it in the doc object. We then use the CSS selector to parse and display the title text using the title tag as a CSS selector.

Extract Multiple Elements Using CSS Selector

Next, we will extract multiple elements using the CSS Selector. We will use the https://books.toscrape.com website. PyQuery has built-in support to extract HTML from URL let’s also leverage it for the next example:

from pyquery import PyQuery as pq
doc = pq(url="https://books.toscrape.com")
for link in doc("h3>a"):
   print(link.text, link.attrib["href"])

We are using CSS Selector to grab all the links inside the H3 tags. And, then using a for loop we are printing the text and URL of those links. Depending on the number of elements the CSS selector will return one or more elements. To access the element properties we use the attrib object. The syntax is the same as the python dictionary. So, we simply pass the “href” as a key and it returns the URL of the element.

Removing Elements

Sometimes we might need to remove unwanted elements from the DOM. PyQuery has a method called remove() which can be used for this purpose. Let’s say we want to get rid of all the icons from the above example. We can do it by adding a few lines of code like below:

from pyquery import PyQuery as pq
doc = pq(url="https://books.toscrape.com")
doc("i").remove()
print(doc)

Once, we run this code it will remove all the icons from the doc.

PyQuery vs BeautifulSoup

Both PyQuery and Beautiful Soup are great Python libraries for working with HTML and XML documents with tools for parsing, traversing, and manipulating HTML and XML documents, as well as extracting data from web pages and APIs.

One key difference between PyQuery and Beautiful Soup is the syntax and API that they use. PyQuery is designed to have a syntax and API similar to jQuery, a popular JavaScript library for working with HTML and DOM elements. If you are familiar with jQuery, you should be able to pick up PyQuery quickly. Beautiful Soup, on the other hand, has a different syntax and API that is more similar to the ElementTree library in Python's standard library. If you are familiar with ElementTree, you may find Beautiful Soup easier to use. Also, Beautifulsoup supports HTML sanitization which is handy if you are trying to scrape a website with broken HTML. Beautifulsoup is more feature riched when it comes to built-in functions, however, being lightweight PyQuery can do things much faster than beautifulsoup.

Ultimately, the choice between PyQuery and Beautiful Soup will depend on your specific needs and preferences either one can be a good choice for working with HTML and XML documents in Python.

	PyQuery	BeautifulSoup
Syntax and API	JQuery Like	ElementTree like
Performance	Fast	Good
Support Multiple Parsers	Yes	Yes
Unicode Support	Yes	Yes
HTML Sanitization	No	Yes
Multiple Language support	No	Yes

Conclusion

In conclusion, PyQuery is an easy-to-use Python library for working with HTML and XML documents. Its jQuery-like syntax and API make it easy to parse, traverse, and manipulate HTML and XML documents, and extract data.

While PyQuery is a powerful tool, it is not the only option available for working with HTML and XML documents in Python. Beautiful Soup is another popular library that offers a different syntax and API and is suitable for different use cases. Ultimately, the choice between PyQuery and Beautiful Soup will depend on your specific needs and preferences.

In this article, we have introduced PyQuery and its capabilities, provided examples of how to use it, and compared it to Beautiful Soup. We hope that this information has been useful and that you are now better equipped to work with HTML and XML documents using PyQuery in your own projects.

oxylabs/parse-html-pyquery