Beginners friendly WEB SCRAPING library intended to improve your python data structures referencing skills
FUNCTION:
parse(uri, *args, **kwargs)
uri - target URL or local file PATH to request, e.g. "http://quotes.toscrape.com", "local.html"
*args - REGEX string to parse HTML if you want to replace built in one, e.g. r'your_regex'
SUPPOSED to be the ONLY non-keyword argument
**kwargs - keyword arguments wraping urllib.request.Request arguments,
e.g. (method="POST", headers={'accept': '*', 'user-agent': 'tiny scraper'})
RETURNS:
Dictionary with the following keys: ['response'], ['text'], ['tags'] where
response - is a urllib.request.urlopen() response object
text - response text in "utf-8" encoding
tags - list of dictionaries with the following keys: ['tag'], ['attrs'], ['text'] where
tag - tag name
attrs - dictionary of tag attributes, e.g. {'class': 'text', 'id': 'card'}
text - textual node of a tag
# # import Tiny Scraper assuming it's located in the same directory
from ts import *
# make HTTP request
content = parse('http://quotes.toscrape.com')
# get response object
print('\n\nResponse object:\n', content['response'])
# get response text
print('\n\nHTML document:\n', content['text'])
# print all tag elements
print('\n\nAll tag elements:\n')
print(json.dumps(content['tags'], indent=2))
# print all tag names
print('\n\nAll tag names:\n')
for item in content['tags']:
print(item['tag'])
# print all tag attributes
print('\n\nAll tag attributes:\n')
for item in content['tags']:
print(item['attrs'])
# print all tag textual nodes
print('\n\nAll tag text:\n')
for item in content['tags']:
print(item['text'])
# print all quotes
print('\n\nExtracted quotes:\n')
quotes = [
print(item['text'])
for item in content['tags']
if item['tag'] == 'span' and
item['attrs']['class'] == 'text'
]
# print all authors
print('\n\nExtracted authors:\n')
authors = [
print(item['text'])
for item in content['tags']
if item['tag'] == 'small' and
item['attrs']['class'] == 'author'
]
# print all author links
print('\n\nAuthor detail URLs:\n')
links = [
print(item['attrs']['href'])
for item in content['tags']
if item['tag'] == 'a' and
item['text'] == '(about)'
]
# print all tags
print('\n\nAll tags:\n')
tags = [
print(item['text'], item['attrs']['href'])
for item in content['tags']
if item['tag'] == 'a' and
item['attrs']['class'] == 'tag'
]