A simple crawling utility for Python
This project enables site crawling and data extraction with xpath and css selectors. You can also send forms such as text data, files, and checkboxes.
- Python3
name | Description |
---|---|
send | Set the value you want to submit to the form. |
submit | Submit form. |
css | Get node by css selector. |
xpath | Get node by xpath. |
attr | Get node's attribute. |
inner_text | Get node's inner text. |
outer_text | Get node's outer text. |
import pycrawl
url = 'http://www.example.com/'
doc = pycrawl.PyCrawl(
url,
user_agent='<user agent>',
timeout=<timeout sec>,
encoding=<encoding>,
)
# access another url
doc.get('another url')
# get current url
doc.url
# get current site's html
doc.html
# get <table> tags as dict
doc.table
# search for nodes by css selector
# tag : css('name')
# class : css('.name')
# id : css('#name')
doc.css('div')
doc.css('.main-text')
doc.css('#tadjs')
# search for nodes by xpath
doc.xpath('//*[@id="top"]/div[1]')
# other example
doc.css('div').css('a')[2].attr('href') # => string object
doc.css('p').inner_text() # => string object
# You do not need to specify "[]" to access the first index
- Specify target node's attribute
- Specify value(int or str) / selected(bool) / file_name(str)
- call submit() with form attribute specified
# login
doc.send(id='id attribute', value='value to send')
doc.send(id='id attribute', value='value to send')
doc.submit(id='id attribute') # submit
# post file
doc.send(id='id attribute', file_name='target file name')
# checkbox
doc.send(id='id attribute', selected=True) # check
doc.send(id='id attribute', selected=False) # uncheck
# example of specify other attribute
doc.send(name='name attribute', value='hello')
doc.send(class_='class attribute', value=100)
# when specifying the class attribute, please write "class_ =".
$ pip install pycrawl
Bug reports and pull requests are welcome on GitHub at https://github.com/averak/pycrawl.