/scrapyd-client

Command line client for Scrapyd server

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Scrapyd-client

PyPI Version Build Status Coverage Status Python Version

Scrapyd-client is a client for Scrapyd. It provides:

Command line tools:

  • scrapyd-deploy, to deploy your project to a Scrapyd server
  • scrapyd-client, to interact with your project once deployed

Python client:

  • ScrapydClient, to interact with Scrapyd within your python code

It is configured using the Scrapy configuration file.

scrapyd-deploy

Deploying your project to a Scrapyd server involves:

  1. Eggifying your project.
  2. Uploading the egg to the Scrapyd server through the addversion.json webservice.

The scrapyd-deploy tool automates the process of building the egg and pushing it to the target Scrapyd server.

Deploying a project

  1. Change (cd) to the root of your project (the directory containing the scrapy.cfg file)

  2. Eggify your project and upload it to the target:

    scrapyd-deploy <target> -p <project>

If you don't have a setup.py file in the root of your project, one will be created. If you have one, it must set the entry_points keyword argument in the setup() function call, for example:

setup(
    name         = 'project',
    version      = '1.0',
    packages     = find_packages(),
    entry_points = {'scrapy': ['settings = projectname.settings']},
)

If the command is successful, you should see a JSON response, like:

Deploying myproject-1287453519 to http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "spiders": ["spider1", "spider2"]}

To save yourself from having to specify the target and project, you can configure your defaults in the Scrapy configuration file.

Versioning

By default, scrapyd-deploy uses the current timestamp for generating the project version. You can pass a custom version using --version:

scrapyd-deploy <target> -p <project> --version <version>

See Scrapyd's documentation on how it determines the latest version.

If you use Mercurial or Git, you can use HG or GIT respectively as the argument supplied to --version to use the current revision as the version. You can save yourself having to specify the version parameter by adding it to your target's entry in scrapy.cfg:

[deploy]
...
version = HG

Note: The version keyword argument in the setup() function call in the setup.py file has no meaning to Scrapyd.

Include dependencies

  1. Create a requirements.txt file at the root of your project, alongside the scrapy.cfg file

  2. Use the --include-dependencies option when building or deploying your project:

    scrapyd-deploy --include-dependencies

Alternatively, you can install the dependencies directly on the Scrapyd server.

Include data files

  1. Create a setup.py file at the root of your project, alongside the scrapy.cfg file, if you don't have one:

    scrapyd-deploy --build-egg=/dev/null
  2. Set the package_data and include_package_data` keyword arguments in the ``setup() function call in the setup.py file. For example:

    from setuptools import setup, find_packages
    
    setup(
        name         = 'project',
        version      = '1.0',
        packages     = find_packages(),
        entry_points = {'scrapy': ['settings = projectname.settings']},
        package_data = {'projectname': ['path/to/*.json']},
        include_package_data = True,
    )

Local settings

You may want to keep certain settings local and not have them deployed to Scrapyd.

  1. Create a local_settings.py file at the root of your project, alongside the scrapy.cfg file

  2. Add the following to your project's settings file:

    try:
        from local_settings import *
    except ImportError:
        pass

scrapyd-deploy doesn't deploy anything outside of the project module, so the local_settings.py file won't be deployed.

Troubleshooting

  • Problem: A settings file for local development is being included in the egg.

    Solution: See Local settings. Or, exclude the module from the egg. If using scrapyd-client's default setup.py file, change the find_package() call:

    setup(
        name         = 'project',
        version      = '1.0',
        packages     = find_packages(),
        entry_points = {'scrapy': ['settings = projectname.settings']},
    )

    to:

    setup(
        name         = 'project',
        version      = '1.0',
        packages     = find_packages(exclude=["myproject.devsettings"]),
        entry_points = {'scrapy': ['settings = projectname.settings']},
    )
  • Problem: Code using __file__ breaks when run in Scrapyd.

    Solution: Use pkgutil.get_data instead. For example, change:

    path = os.path.dirname(os.path.realpath(__file__))  # BAD
    open(os.path.join(path, "tools", "json", "test.json"), "rb").read()

    to:

    import pkgutil
    pkgutil.get_data("projectname", "tools/json/test.json")
  • Be careful when writing to disk in your project, as Scrapyd will most likely be running under a different user which may not have write access to certain directories. If you can, avoid writing to disk and always use tempfile for temporary files.

  • If you use a proxy, use the HTTP_PROXY, HTTPS_PROXY, NO_PROXY and/or ALL_PROXY environment variables, as documented by the requests package.

scrapyd-client

For a reference on each subcommand invoke scrapyd-client <subcommand> --help.

Where filtering with wildcards is possible, it is facilitated with fnmatch. The --project option can be omitted if one is found in a scrapy.cfg.

deploy

This is a wrapper around scrapyd-deploy.

targets

Lists all targets:

scrapyd-client targets

projects

Lists all projects of a Scrapyd instance:

# lists all projects on the default target
scrapyd-client projects
# lists all projects from a custom URL
scrapyd-client -t http://scrapyd.example.net projects

schedule

Schedules one or more spiders to be executed:

# schedules any spider
scrapyd-client schedule
# schedules all spiders from the 'knowledge' project
scrapyd-client schedule -p knowledge \*
# schedules any spider from any project whose name ends with '_daily'
scrapyd-client schedule -p \* \*_daily
# schedules spider1 in project1 specifying settings
scrapyd-client schedule -p project1 spider1 --arg 'setting=DOWNLOADER_MIDDLEWARES={"my.middleware.MyDownloader": 610}'

spiders

Lists spiders of one or more projects:

# lists all spiders
scrapyd-client spiders
# lists all spiders from the 'knowledge' project
scrapyd-client spiders -p knowledge

ScrapydClient

Interact with Scrapyd within your python code.

from scrapyd_client import ScrapydClient
client = ScrapydClient()

for project in client.projects():
   print(client.jobs(project=project))

Scrapy configuration file

Targets

You can define a Scrapyd target in your project's scrapy.cfg file. Example:

[deploy]
url = http://scrapyd.example.com/api/scrapyd
username = scrapy
password = secret
project = projectname

You can now deploy your project without the <target> argument or -p <project> option:

scrapyd-deploy

If you have multiple targets, add the target name in the section name. Example:

[deploy:targetname]
url = http://scrapyd.example.com/api/scrapyd

[deploy:another]
url = http://other.example.com/api/scrapyd

If you are working with CD frameworks, you do not need to commit your secrets to your repository. You can use environment variable expansion like so:

[deploy]
url = $SCRAPYD_URL
username = $SCRAPYD_USERNAME
password = $SCRAPYD_PASSWORD

or using this syntax:

[deploy]
url = ${SCRAPYD_URL}
username = ${SCRAPYD_USERNAME}
password = ${SCRAPYD_PASSWORD}

To deploy to one target, run:

scrapyd-deploy targetname -p <project>

To deploy to all targets, use the -a option:

scrapyd-deploy -a -p <project>

While your target needs to be defined with its URL in scrapy.cfg, you can use netrc for username and password, like so:

machine scrapyd.example.com
    login scrapy
    password secret