/os-pywf

Trying PyWorkflow

Primary LanguagePythonMIT LicenseMIT

os-pywf

Build Status codecov PyPI - Python Version PyPI

Workflow and it's Python binding PyWorkflow are great async frameworks.

This project is trying to explore the power of Workflow, provide command line tools and high level APIs for real world development.

Install

pip install os-pywf

Commands

os-pywf command can be used after installation. You can get help information with --help option. Global settings of Workflow can be specified, ENVs is not supported yet.

The subcommands with planning tag will be developed later, can not be used right now.

$ os-pywf --help
Usage: os-pywf [OPTIONS] COMMAND [ARGS]...

  Command line tool for os-pywf.

Options:
  --version                       Show the version and exit.
  Workflow:                       Workflow global settings.
    --compute-threads INTEGER     Number of compute threads.  [default: 4]
    --handler-threads INTEGER     Number of handler threads.  [default: 4]
    --poller-threads INTEGER      Number of poller threads.  [default: 4]
    --dns-threads INTEGER         Number of dns threads.  [default: 4]
    --dns-ttl-default INTEGER     Default seconds of dns ttl.  [default:
                                  43200]

    --dns-ttl-min INTEGER         Min seconds of dns ttl.  [default: 180]
    --max-connections INTEGER     Max number of connections.  [default: 200]
    --connection-timeout INTEGER  Connect timeout(ms).  [default: 10000]
    --response-timeout INTEGER    Response timeout(ms).  [default: 10000]
    --ssl-connect-timeout INTEGER
                                  SSL connect timeout(ms).  [default: 10000]
  --help                          Show this message and exit.

Commands:
  curl    HTTP client inspired by curl (beta).
  mysql   MySQL client (planning).
  proxy   HTTP proxy (planning).
  redis   Redis client (planning).
  run     Run runnable objects of pywf (planning).
  spider  Web spider (planning).
  web     Web server (planning).

curl

This subcommand is inspired by curl. It works as curl and provides more useful features especially invoke Python function as response callback, which make it flexible and easy to extend.

$ os-pywf curl --help
Usage: os-pywf curl [OPTIONS] [URLS]...

  HTTP client inspired by curl (beta).

Options:
  Curl options:                   Options same as curl.
    -0, --http1.0                 Use HTTP 1.0
    -A, --user-agent TEXT         User-Agent to send to server.  [default: os-
                                  pywf/0.0.1]

    -b, --cookie TEXT             String or file to read cookies from.
    -c, --cookie-jar FILENAME     Write cookies to this file after operation.
    -d, --data TEXT               HTTP POST data.
    --data-urlencode TEXT         HTTP POST data url encoded.
    -e, --referer TEXT            Referer URL.
    -F, --form TEXT               Specify HTTP multipart POST data.
    -H, --header TEXT             Custom header to pass to server.
    -L, --location                Follow redirects.
    --max-filesize INTEGER        Maximum data size (in bytes) to download.
    --max-redirs INTEGER          Maximum number of redirects allowed.
                                  [default: 30]

    -u, --user TEXT               Specify the user name and password to use
                                  for  server  authentication.

    --no-keepalive                Disable keepalive.
    --retry INTEGER               Maximum retries when request fail.
                                  [default: 0]

    --retry-delay FLOAT           Time between two retries(s).  [default: 0]
    -x, --proxy TEXT              Specify proxy.
    -X, --request [DELETE|GET|HEAD|OPTIONS|PATCH|POST|PUT]
                                  Request method. [default: GET]
  Additional options:             Additional options.
    --send-timeout FLOAT          Send request timeout(s).  [default: -1]
    --receive-timeout FLOAT       Receive response timeout(s).  [default: -1]
    --startup TEXT                Function invoked when startup.  [default:
                                  os_pywf.commands.curl.startup]

    --cleanup TEXT                Function invoked when cleanup.  [default:
                                  os_pywf.commands.curl.cleanup]

    --callback TEXT               Function invoked when response received.
                                  [default: os_pywf.commands.curl.callback]

    --errback TEXT                Function invoked when request fail (callback
                                  will be invoked when no errback).

    --parallel                    Send requests parallelly.
    --log-level [CRITICAL|ERROR|WARNING|INFO|DEBUG]
                                  Log level.  [default: INFO]
    --debug                       Enable debug mode.
  --help                          Show this message and exit.

Example:

# app.py
def callback(task, request, response):
    print(request, response)
os-pywf curl http://www.example.com/ --callback app.callback

Features:

  • Same options as curl, command line can be used by curl directly
  • Support HTTP version 1.0/1.1
  • Auto manipulate cookies. Cookies can be specified by command line or read from file. Cookies can be saved to file
  • Support post urlencode data
  • Support upload files as multipart form
  • Support redirect. Response history can be accessed with response.history
  • Support retry and retry interval. The program can be quickly canceled when retrying
  • All requests can be send parallelly (async not multithread)
  • Custom startup/cleanup/callback/errback function as plugins
  • Callback with request and response parameters of the most famous Requests library
  • Support auto decompress response data (v0.0.2)
  • Support set proxy for http (not https) request (v0.0.3)
  • Generate requests from callback and download continuously (v0.0.4)

Issues/Not support:

  • Configure proxy
  • Use your own cert
  • Ctrl+C to quit program slowly when downloading slow response

The command provides two types of options, curl options and additional options. Run os-pywf curl --help to get the full help information.

curl options are same as the options of curl. Usage can be found on man page of curl and help descriptions.

additional options enhance curl and provide additional features.

  • --send-timeout, send request timeout (second), default (-1) is no timeout

  • --receive-timeout, receive response timeout (second), default (-1) behavior depends on some other settings such as response timeout

  • --startup, a function invoked when startup, before download pages. The function have only one parameter which is the series or the parallel of Workflow

  • --cleanup, a function invoked when cleanup, after all downloads finish. The function have only one parameter same as startup function

    # app.py
    def startup(runner):
        pass
    
    def cleanup(runner):
        pass
    
    os-pywf curl http://www.example.com/ --startup app.startup --cleanup app.cleanup
    
  • --callback, --errback functions invoked when response received or fail, see more details

  • --parallel, requests will be send parallelly. Attention, the framework is asynchronous, all callback/errback invoked in one thread. Block operations in callback/errback will block the whole world

APIs

os_pywf.http.client

This module provides hight level HTTP client APIs. Inspired by the most famous Python HTTP library Requests, the APIs are nearly the same.

All of the request APIs do not send request and block wait response, they all return HttpTask object for Workflow and invoke callback function when response downloaded.

We wrap the PyWorkflow HttpTask and provide more convenient callback with request and response as additional parameters, they all typical instance of Requests library as you know.

import pywf
from os_pywf.http import client

def callback(task, request, response):
    print(request, response)

task = client.get("http://www.example.com/", callback=callback)
task.start()
pywf.wait_finish()

We provide more useful features which PyWorkflow not support directly:

  • session with cookies persistence
  • redirect responses history
  • retry interval and quick cancel
  • authentication
  • post urlencode data and multipart files upload
  • auto decompress response data (v0.0.2)
  • set proxy for http (not https) request (v0.0.3)

You can use Session to configure same settings of a group tasks, it also auto manipulate cookies and provide cancel function to cancel all tasks create by the same session. You can create Session as normal class or as a context manager:

import pywf
from os_pywf.http import client
from os_pywf.utils import create_series_work

def callback(task, request, response):
    print(request, response)
    
series = create_series_work()
headers = {"User-Agent": "os-pywf/beta"}
with client.Session(headers=headers, callback=callback) as session:
    for url in ["http://www.example.com/", "http://httpbin.org/"]:
        task = session.get(url)
        series.push_back(task)
series.start()
pywf.wait_finish()

Session can be canceled, when canceled the tasks created by the session which not started will be destroyed, running task will still run until finish but callback will not invoked.

...
# register cancel for Ctrl+C 
with client.Session() as session:
    def _cancel(signum, frame):    
        session.cancel()
    for sig in (signal.SIGTERM, signal.SIGINT):
        signal.signal(sig, _cancel)
...

callback/errback

For callback async type of Workflow, we provide two functions as request/session parameters for framework: callback and errback

We wrap PyWorkflow with most famous Python HTTP library Requests and provide more powerful callback and errback

  • callback, invoked when response received, three parameters: task, request, response

    def callback(task, request, response):
        pass
    
    • task, the PyWorkflow HttpTask object
    • request, requests.PreparedRequest object, it is the original request even though there are retries and redirects
    • response, requests.Response object, it is the final response when there are retries and redirects. You can get all the response when redirect occur. If not set errback function, the response will be os_pywf.exceptions.Failure object when transaction fail (all HTTP response treat as success)
  • errback, invoked when transaction fail. It can be ignored, all of the response and fail will invoke callback function, three parameters: task, request, failure

    def errback(task, request, failure):
        pass
    
    • task, the PyWorkflow HttpTask object
    • request, same as the parameter of callback
    • Failure, os_pywf.exceptions.Failure object, it has two properties: exception and value. The value property maybe None or requests.Response depends on the fail situation
  • both callback and errback can have return value (from v0.0.4) for framework to schedule. There are several types object can be returned

    • str,must be URL,it will be wrapped with session as HttpTask and add to the head of the series
    • requests.Request, it will be wrapped with session as HttpTask and add to the head of the series
    • requests.PreparedRequest, it will be wrapped without session as HttpTask and add to the head of the series
    • pywf.SubTask, it will be add to the head of the series
    • list, the elements will be treated as above object add to the head of the series from last to first
    • tuple, first element treated as above object, second element will add to the tail of the series

os_pywf.utils

  • create_series_work, wrap the create_series_work of PyWorkflow, you can pass arbitrary tasks to create series.

  • create_timer_task, wrap the create_timer_task of PyWorkflow. It split the wait time into small time pieces, so it can be canceled as soon as possible.

    You can pass a threading.Event object as cancel parameter.

os_pywf.exceptions

  • Failure, failure for usually for errback, two properties: exception and value. The real value object depend on fail situation
  • WFException, exception about task fail, two properties: state and code. state come from task.get_state(), code come from task.get_error(). You can get human readable error string by use built-in str function.

Unit Tests

sh scripts/test.sh

License

MIT licensed.