PyParsy
PyParsy is an HTML parsing library using YAML definition files. The idea is to use the YAML file as sort of intent - what you want to have as a result and let Parsy do the heavy lifting for you. The differences to other similar libraries (e.g. selectorlib) is that it supports multiple version of selectors for a single field. This way you will not need to create a new yaml definition file for every change on a website.
The YAML files contain:
- The desired structure of the output
- XPath/CSS/Regex selectors for the element extraction
- Return type definition
- Optional children of the field
Features
- YAML File definitions
- YAML File validation
- Intent instead of coding
- support for XPath, CSS and Regex selectors
- Different output formats e.g. JSON, YAML, XML
- Somewhat opinionated
- 99% coverage
Installation
Using pip:
pip install pyparsy
Running Tests
To run tests, run the following command
poetry run pytest
YAML Structure
<field_name>:
Field name is the top level of the yamlselector:
<selector_definition>
- The Selector expressionselector_type:
<selector_type[XPATH, CSS, REGEX]>
- The type of the selector expression only in ofXPATH, CSS, REGEX
multiple:
<true/flase>
[Optional] true - get all matching results as list, false - get first matching resultreturn_type:
<return_type[STRING, INTEGER, FLOAT, MAP]
- Desired return type on ofSTRING, INTEGER, FLOAT or MAP
children:
<list of definitions
[Optional] - used forreturn_type: MAP
Examples
We can consider as an example the amazon bestseller page. First we define the .yaml definition file:
title:
selector: //div[contains(@class, "_card-title_")]/h1/text()
selector_type: XPATH
return_type: STRING
page:
selector: //ul[contains(@class, "a-pagination")]/li[@class="a-selected"]/a/text()
selector_type: XPATH
return_type: INTEGER
products:
selector: //div[@id="gridItemRoot"]
selector_type: XPATH
multiple: true
return_type: MAP
children:
image:
selector: //img[contains(@class, "a-dynamic-image")]/@src
selector_type: XPATH
return_type: STRING
title:
selector: //a[@class="a-link-normal"]/span/div/text()
selector_type: XPATH
return_type: STRING
price:
selector: //span[contains(@class, "a-color-price")]/span/text()
selector_type: XPATH
return_type: FLOAT
asin:
selector: //div[contains(@class, "sc-uncoverable-faceout")]/@id
selector_type: XPATH
return_type: STRING
reviews_count:
selector: //div[contains(@class, "sc-uncoverable-faceout")]/div/div/a/span/text()
selector_type: XPATH
return_type: INTEGER
Then we can use this definition in code:
from pathlib import Path
from pyparsy import Parsy
import httpx
import json
def main():
parser = Parsy.from_file(Path('tests/assets/amazon_bestseller_de.yaml'))
response = httpx.get("https://www.amazon.de/-/en/gp/bestsellers/ce-de/ref=zg_bs_nav_0")
result = parser.parse(response.text)
print(json.dumps(dict(result), indent=4))
if __name__ == "__main__":
main()
Will result in:
{
"title": "Best Sellers in Electronics & Photo",
"page": 1,
"products": [
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/81ZnAYiX5sL._AC_UL300_SR300,200_.jpg",
"title": "Amazon Basics High Power 1.5V AA Alkaline Batteries, Pack of 48 (Appearance May Vary)",
"price": 19.12,
"asin": "B00MNV8E0C",
"reviews_count": 526202
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/71C3lbbeLsL._AC_UL300_SR300,200_.jpg",
"title": "All-new Echo Dot (5th generation, 2022 release) smart speaker with Alexa | Charcoal",
"price": 59.99,
"asin": "B09B8X9RGM",
"reviews_count": 760
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/811OG1FsNFL._AC_UL300_SR300,200_.jpg",
"title": "Fire TV Stick with Alexa Voice Remote (includes TV controls) | HD streaming device",
"price": 39.99,
"asin": "B08C1KN5J2",
"reviews_count": 92504
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/81FGpGF5kaL._AC_UL300_SR300,200_.jpg",
"title": "Amazon Basics AA Industrial Alkaline Batteries, Pack of 40",
"price": 11.78,
"asin": "B07MLFBJG3",
"reviews_count": 72375
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/61ymYQD3gaL._AC_UL300_SR300,200_.jpg",
"title": "Fire TV Stick 4K with Alexa Voice Remote (includes TV controls)",
"price": 59.99,
"asin": "B08XW4FDJV",
"reviews_count": 46503
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/61UV1sshWKL._AC_UL300_SR300,200_.jpg",
"title": "Varta Lithium Button Cell Battery",
"price": 3.29,
"asin": "B00TYEL11K",
"reviews_count": 62993
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/61bLsZejhPL._AC_UL300_SR300,200_.jpg",
"title": "Instax Fujifilm Mini Instant Film, White, 2 x 10 Sheets (20 Sheets)",
"price": 15.95,
"asin": "B0000C73CQ",
"reviews_count": 197326
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/71PTROtCLRL._AC_UL300_SR300,200_.jpg",
"title": "2032 20 40 Cell Battery Silver",
"price": 8.99,
"asin": "B07CSZ575S",
"reviews_count": 16096
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/51CDcTTd3-S._AC_UL300_SR300,200_.jpg",
"title": "Apple AirTag, pack of 4",
"price": 119.0,
"asin": "B0935JRJ59",
"reviews_count": 47525
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/71yf6yTNWSL._AC_UL300_SR300,200_.jpg",
"title": "All-new Echo Dot (5th generation, 2022 release) smart speaker with clock and Alexa | Cloud Blue",
"price": 69.99,
"asin": "B09B8RVKGW",
"reviews_count": 665
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/71EFiZtPjML._AC_UL300_SR300,200_.jpg",
"title": "Duracell Plus C Baby Alkaline Batteries 1.5 V LR14 MN1400 Pack of 4",
"price": 7.13,
"asin": "B093C9FN7W",
"reviews_count": 42466
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/51Z0FcUPmgL._AC_UL300_SR300,200_.jpg",
"title": "ooono traffic alarm: Warns about speed cameras and hazards in road traffic in real time, automatically active after connection to smartphone via Bluetooth, data from Blitzer.de",
"price": 49.95,
"asin": "B07Q619ZKS",
"reviews_count": 26587
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/71g8a2BcgRL._AC_UL300_SR300,200_.jpg",
"title": "Fire TV Stick 4K Max streaming device, Wi-Fi 6, Alexa Voice Remote (includes TV controls)",
"price": 64.99,
"asin": "B08MT4MY9J",
"reviews_count": 30523
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/81Tt3+NBcSL._AC_UL300_SR300,200_.jpg",
"title": "KabelDirekt - 2m - 4K HDMI Cable (4K @120Hz & 4K @60Hz - Spectacular Ultra HD Experience - High Speed with Ethernet - HDMI 2.0/1.4, Blu-ray/PS4/PS5/Xbox Series X/Switch - Black",
"price": 7.99,
"asin": "B004BEMD5Q",
"reviews_count": 125724
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/61a3VAbtpQL._AC_UL300_SR300,200_.jpg",
"title": "Soundcore Life P2 Bluetooth Headphones, Wireless Earbuds with CVC 8.0 Noise Isolation for a Crystal Clear Sound Profile, 40-hour Battery Life, IPX7 Water Protection Class, for Work and Travel",
"price": 23.99,
"asin": "B07SJR6HL3",
"reviews_count": 115557
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/41qfJN7dLhL._AC_UL300_SR300,200_.jpg",
"title": "Fire TV Stick Lite mit Alexa-Sprachfernbedienung Lite (ohne TV-Steuerungstasten) | HD-Streamingger\u00e4t",
"price": 29.99,
"asin": "B091G3WT74",
"reviews_count": 5601
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/41hX+2Es+vL._AC_UL300_SR300,200_.jpg",
"title": "Echo Dot (3rd Gen) - Smart speaker with Alexa - Charcoal Fabric",
"price": 49.99,
"asin": "B07PHPXHQS",
"reviews_count": 312374
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/71nnAxdMtkL._AC_UL300_SR300,200_.jpg",
"title": "Pack of 40 AG13 LR44 1.5 V Alkaline Button Cell Batteries, Mercury-Free (357 / 357A / L1154 / A76 / GPA76)",
"price": 6.99,
"asin": "B079HZ6RQR",
"reviews_count": 10692
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/418AP8pw3KL._AC_UL300_SR300,200_.jpg",
"title": "EarPods with Lightning Connector",
"price": 16.9,
"asin": "B01M1EEPOB",
"reviews_count": 22486
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/715i0StnSlS._AC_UL300_SR300,200_.jpg",
"title": "Amazon Basics High Capacity AA Rechargeable 2400mAh Batteries Pre-Charged Pack of 12",
"price": 21.94,
"asin": "B07NWT6YLD",
"reviews_count": 146981
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/61iYFNhtwHL._AC_UL300_SR300,200_.jpg",
"title": "NEW'C tempered glass foil, protective foil for iPhone 11, iPhone XR, 2pcs., free from scratches, fingerprints and oil, 9H hardness, 0.33 mm ultra clear, screen protective foil for iPhone 11, iPhone XR",
"price": 5.99,
"asin": "B07NC8PWDM",
"reviews_count": 76098
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/81Pd4ogDITL._AC_UL300_SR300,200_.jpg",
"title": "LiCB CR2032 3V Lithium Button Cell Batteries CR 2032 Pack of 10",
"price": 6.99,
"asin": "B07P7V9SP7",
"reviews_count": 10781
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/61MRw0Bun4L._AC_UL300_SR300,200_.jpg",
"title": "Varta Ready2Use Rechargeable Battery, Pre-Charged AAA Micro Ni-Mh Battery, Pack of 4, 1000 mAh, Rechargeable without Memory Effect, Ready to Use",
"price": 10.22,
"asin": "B000IGW3JC",
"reviews_count": 47147
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/61pBvlYVPxL._AC_UL300_SR300,200_.jpg",
"title": "Amazon Basics - high-speed cable, Ultra HD HDMI 2.0, supports 3D formats, with audio return channel, 1.8 m",
"price": 6.99,
"asin": "B014I8SSD0",
"reviews_count": 469788
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/71AwNMpA29L._AC_UL300_SR300,200_.jpg",
"title": "Instax Mini 11 Camera",
"price": 79.0,
"asin": "B084S3Y6L1",
"reviews_count": 14441
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/51hsq3bombL._AC_UL300_SR300,200_.jpg",
"title": "Soundcore by Anker Life P2 Mini Bluetooth Headphones, In-Ear Headphones with 10 mm Audio Driver, Intense Bass, EQ, Bluetooth 5.2, 32 Hours Battery, Charging with USB-C, Minimalist Design (Night Black)",
"price": 39.99,
"asin": "B099DP3617",
"reviews_count": 14245
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/61pUgAx+pPL._AC_UL300_SR300,200_.jpg",
"title": "NEW'C Pack of 3 Tempered Protective Glass for iPhone 14, 13, 13 Pro (6.1 inches), Free from Scratches, 9H Hardness, HD Screen Protector, 0.33 mm Ultra Clear, Ultra Resistant",
"price": 5.99,
"asin": "B09F3P3DQD",
"reviews_count": 10136
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/71fzcZQlbqS._AC_UL300_SR300,200_.jpg",
"title": "Echo Show 5 | 2nd generation (2021 release), smart display with Alexa and 2 MP camera | Charcoal",
"price": 84.99,
"asin": "B08KH2MTSS",
"reviews_count": 22944
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/81Jz6OogtbL._AC_UL300_SR300,200_.jpg",
"title": "Misxi hard case with glass screen protector compatible with Apple Watch Series 6 / SE / Series 5 / Series 4 44 mm, pack of 2",
"price": 10.99,
"asin": "B07ZRMCRG7",
"reviews_count": 28111
},
{
"image": "https://images-eu.ssl-images-amazon.com/images/I/71gG2vN8FFS._AC_UL300_SR300,200_.jpg",
"title": "Duracell Plus AAA Micro Alkaline Batteries 1.5 V LR03 MN2400 Pack of 12",
"price": 6.99,
"asin": "B093LT2N4Q",
"reviews_count": 20307
}
]
}
For the example sake let's store the file as amazon_bestseller.yaml
.
Then we can use the PyParsy library in out code:
import httpx
from pathlib import Path
from pyparsy import Parsy
def main():
html = httpx.get("https://www.amazon.com/gp/bestsellers/hi/?ie=UTF8&ref_=sv_hg_1")
parser = Parsy.from_file(Path("amazon_bestseller.yaml"))
result = parser.parse(html.text)
print(result)
if __name__ == "__main__":
main()
For more examples please see the tests for the library.
Documentation
Documentation (hopefully some day)
Acknowledgements
- selectorlib - It is the main inspiration for this project
- Scrapy - One of the best crawling libraries for Python
- parsel - Scrapy parsing library is heavily used in this project and can be considered main dependency.
- schema - Used for validating the YAML file schema
Contributing
Contributions are very much welcome. Just create your Pull request with enough tests.