Grab Framework Project

Project Status

Important notice: pycurl backend is dropped. The only network transport now is urllib3.

The project is in a slow refactoring stage. It might be possible there will not be new features.

Things that are going to happen (no estimation time):

Refactoring the source code while keeping most of the external API unchanged
Fixing bugs
Annotating source code with type hints
Improving the quality of source code to comply with pylint and other linters
Moving some features into external packages or moving external dependencies inside Grab
Fixing memory leaks
Improving test coverage
Adding more platforms and python versions to test matrix
Releasing new versions on pypi

Installation

$ pip install -U grab

See details about installing Grab on different platforms here http://docs.grablib.org/en/latest/usage/installation.html

Documentation

Get it here grab.readthedocs.io

About Grab (very old description)

Grab is a python web scraping framework. Grab provides a number of helpful methods to perform network requests, scrape websites and process the scraped content:

Automatic cookies (session) support
HTTPS/SOCKS proxy support with/without authentication
Keep-Alive support
IDN support
Tools to work with web forms
Easy multipart file uploading
Flexible customization of HTTP requests
Automatic charset detection
Powerful API to extract data from DOM tree of HTML documents with XPATH queries

Grab provides an interface called Spider to develop multithreaded website scrapers:

Rules and conventions to organize crawling logic
Multiple parallel network requests
Automatic processing of network errors (failed tasks go back to a task queue)
You can create network requests and parse responses with Grab API (see above)
Different backends for task queue (in-memory, redis, mongodb)
Tools to debug and collect statistics

Grab Example

    import logging

    from grab import Grab

    logging.basicConfig(level=logging.DEBUG)

    g = Grab()

    g.go('https://github.com/login')
    g.doc.set_input('login', '****')
    g.doc.set_input('password', '****')
    g.doc.submit()

    g.doc.save('/tmp/x.html')

    g.doc('//ul[@id="user-links"]//button[contains(@class, "signout")]').assert_exists()

    home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()
    repo_url = home_url + '?tab=repositories'

    g.go(repo_url)

    for elem in g.doc.select('//h3[@class="repo-list-name"]/a'):
        print('%s: %s' % (elem.text(),
                          g.make_url_absolute(elem.attr('href'))))

Grab::Spider Example

    import logging

    from grab.spider import Spider, Task

    logging.basicConfig(level=logging.DEBUG)


    class ExampleSpider(Spider):
        def task_generator(self):
            for lang in 'python', 'ruby', 'perl':
                url = 'https://www.google.com/search?q=%s' % lang
                yield Task('search', url=url, lang=lang)

        def task_search(self, grab, task):
            print('%s: %s' % (task.lang,
                              grab.doc('//div[@class="s"]//cite').text()))


    bot = ExampleSpider(thread_number=2)
    bot.run()

Community

Telegram English chat: https://t.me/grablab

Telegram Russian chat: https://t.me/grablab_ru

AMetIR/grab