bellisk/ckanext-harvest

chain

sudo apt-get update
sudo apt-get install redis-server

ckan.harvest.mq.type = redis

sudo apt-get update
sudo apt-get install rabbitmq-server

ckan.harvest.mq.type = amqp

$ . /usr/lib/ckan/default/bin/activate

(pyenv) $ pip install -e git+https://github.com/ckan/ckanext-harvest.git#egg=ckanext-harvest

(pyenv) $ cd /usr/lib/ckan/default/src/ckanext-harvest/
(pyenv) $ pip install -r requirements.txt

ckan.plugins = harvest ckan_harvester

ckan.harvest.mq.type = redis

sudo service apache2 restart

http://localhost/harvest

ckan.harvest.log_scope = 0

ckan.harvest.log_timeframe = 10

ckan.harvest.log_level = info

$ curl {ckan_url}/api/3/action/harvest_log_list

$ curl {ckan_url}/api/3/action/harvest_log_list?level=info

{
  "help":"http://127.0.0.1:5000/api/3/action/help_show?name=harvest_log_list",

  "success":true,

  "result": [{"content":"Sent job aa987717-2316-4e47-b0f2-cbddfb4c4dfc to the gather queue","level":"INFO","created":"2016-06-03 10:59:40.961657"}, {"content":"Sent job aa987717-2316-4e47-b0f2-cbddfb4c4dfc to the gather queue","level":"INFO","created":"2016-06-03 10:59:40.951548"}]

}

harvester source {name} {url} {type} [{title}] [{active}] [{owner_org}] [{frequency}] [{config}]
  - create new harvest source

harvester source {source-id/name}
  - shows a harvest source

harvester rmsource {source-id/name}
  - remove (deactivate) a harvester source, whilst leaving any related
    datasets, jobs and objects

harvester clearsource {source-id/name}
  - clears all datasets, jobs and objects related to a harvest source,
    but keeps the source itself

harvester clearsource-history [{source-id}] [-k]
  - If no source id is given the history for all harvest sources (maximum is 1000)
    will be cleared.
    Clears all jobs and objects related to a harvest source, but keeps the source
    itself. The datasets imported from the harvest source will **NOT** be deleted!!!
    If a source id is given, it only clears the history of the harvest source with
    the given source id.

    To keep the currently active jobs use the -k option.

harvester sources [all]
  - lists harvest sources
    If 'all' is defined, it also shows the Inactive sources

harvester job {source-id/name}
  - create new harvest job

harvester jobs
  - lists harvest jobs

harvester job-abort {source-id/name}
  - marks a job as "Aborted" so that the source can be restarted afresh.
    It ensures that the job's harvest objects status are also marked
    finished. You should ensure that neither the job nor its objects are
    currently in the gather/fetch queues.

harvester run
  - starts any harvest jobs that have been created by putting them onto
    the gather queue. Also checks running jobs - if finished it
    changes their status to Finished.

harvester run-test {source-id/name}
  - runs a harvest - for testing only.
    This does all the stages of the harvest (creates job, gather, fetch,
    import) without involving the web UI or the queue backends. This is
    useful for testing a harvester without having to fire up
    gather/fetch_consumer processes, as is done in production.

harvester run-test {source-id/name} force-import=guid1,guid2...
  - In order to force an import of particular datasets, useful to
    target a dataset for dev purposes or when forcing imports on other environments.

harvester gather-consumer
  - starts the consumer for the gathering queue

harvester fetch-consumer
  - starts the consumer for the fetching queue

harvester purge-queues
  - removes all jobs from fetch and gather queue
    WARNING: if using Redis, this command purges all data in the current
    Redis database

harvester clean-harvest-log
  - Clean-up mechanism for the harvest log table.
    You can configure the time frame through the configuration
    parameter 'ckan.harvest.log_timeframe'. The default time frame is 30 days

harvester [-j] [-o] [--segments={segments}] import [{source-id}]
  - perform the import stage with the last fetched objects, for a certain
    source or a single harvest object. Please note that no objects will
    be fetched from the remote server. It will only affect the objects
    already present in the database.

    To import a particular harvest source, specify its id as an argument.
    To import a particular harvest object use the -o option.
    To import a particular package use the -p option.

    You will need to specify the -j flag in cases where the datasets are
    not yet created (e.g. first harvest, or all previous harvests have
    failed)

    The --segments flag allows to define a string containing hex digits that represent which of
    the 16 harvest object segments to import. e.g. 15af will run segments 1,5,a,f

harvester job-all
  - create new harvest jobs for all active sources.

harvester reindex
  - reindexes the harvest source datasets

ckan.plugins = harvest ckan_harvester

{
 "api_version": 1,
 "default_tags": [{"name": "geo"}, {"name": "namibia"}],
 "default_groups": ["science", "spend-data"],
 "default_extras": {"encoding":"utf8", "harvest_url": "{harvest_source_url}/dataset/{dataset_id}"},
 "override_extras": true,
 "organizations_filter_include": [],
 "organizations_filter_exclude": ["remote-organization"],
 "user":"harverster-user",
 "api_key":"<REMOTE_API_KEY>",
 "read_only": true,
 "remote_groups": "only_local",
 "remote_orgs": "create"
}

from ckanext.harvest.harvesters.ckanharvester import CKANHarvester

class MySiteCKANHarvester(CKANHarvester):

    def modify_package_dict(self, package_dict, harvest_object):

        # Set a default custom field

        package_dict['remote_harvest'] = True

        # Add tags
        package_dict['tags'].append({'name': 'sdi'})

        return package_dict

# setup.py

entry_points='''
    [ckan.plugins]
    my_site=ckanext.my_site.plugin:MySitePlugin
    my_site_ckan_harvester=ckanext.my_site.harvesters:MySiteCKANHarvester
'''

# ini file
ckan.plugins = ... my_site my_site_ckan_harvester

from ckan.plugins.core import SingletonPlugin, implements
from ckanext.harvest.interfaces import IHarvester

class MyHarvester(SingletonPlugin):
'''
A Test Harvester
'''
implements(IHarvester)

def info(self):
    '''
    Harvesting implementations must provide this method, which will return
    a dictionary containing different descriptors of the harvester. The
    returned dictionary should contain:

    * name: machine-readable name. This will be the value stored in the
      database, and the one used by ckanext-harvest to call the appropiate
      harvester.
    * title: human-readable name. This will appear in the form's select box
      in the WUI.
    * description: a small description of what the harvester does. This
      will appear on the form as a guidance to the user.

    A complete example may be::

        {
            'name': 'csw',
            'title': 'CSW Server',
            'description': 'A server that implements OGC's Catalog Service
                            for the Web (CSW) standard'
        }

    :returns: A dictionary with the harvester descriptors
    '''

def validate_config(self, config):
    '''

    [optional]

    Harvesters can provide this method to validate the configuration
    entered in the form. It should return a single string, which will be
    stored in the database.  Exceptions raised will be shown in the form's
    error messages.

    :param harvest_object_id: Config string coming from the form
    :returns: A string with the validated configuration options
    '''

def get_original_url(self, harvest_object_id):
    '''

    [optional]

    This optional but very recommended method allows harvesters to return
    the URL to the original remote document, given a Harvest Object id.
    Note that getting the harvest object you have access to its guid as
    well as the object source, which has the URL.
    This URL will be used on error reports to help publishers link to the
    original document that has the errors. If this method is not provided
    or no URL is returned, only a link to the local copy of the remote
    document will be shown.

    Examples:
        * For a CKAN record: http://{ckan-instance}/api/rest/{guid}
        * For a WAF record: http://{waf-root}/{file-name}
        * For a CSW record: http://{csw-server}/?Request=GetElementById&Id={guid}&...

    :param harvest_object_id: HarvestObject id
    :returns: A string with the URL to the original document
    '''

def gather_stage(self, harvest_job):
    '''
    The gather stage will receive a HarvestJob object and will be
    responsible for:
        - gathering all the necessary objects to fetch on a later.
          stage (e.g. for a CSW server, perform a GetRecords request)
        - creating the necessary HarvestObjects in the database, specifying
          the guid and a reference to its job. The HarvestObjects need a
          reference date with the last modified date for the resource, this
          may need to be set in a different stage depending on the type of
          source.
        - creating and storing any suitable HarvestGatherErrors that may
          occur.
        - returning a list with all the ids of the created HarvestObjects.
        - to abort the harvest, create a HarvestGatherError and raise an
          exception. Any created HarvestObjects will be deleted.

    :param harvest_job: HarvestJob object
    :returns: A list of HarvestObject ids
    '''

def fetch_stage(self, harvest_object):
    '''
    The fetch stage will receive a HarvestObject object and will be
    responsible for:
        - getting the contents of the remote object (e.g. for a CSW server,
          perform a GetRecordById request).
        - saving the content in the provided HarvestObject.
        - creating and storing any suitable HarvestObjectErrors that may
          occur.
        - returning True if everything is ok (ie the object should now be
          imported), "unchanged" if the object didn't need harvesting after
          all (ie no error, but don't continue to import stage) or False if
          there were errors.

    :param harvest_object: HarvestObject object
    :returns: True if successful, 'unchanged' if nothing to import after
              all, False if not successful
    '''

def import_stage(self, harvest_object):
    '''
    The import stage will receive a HarvestObject object and will be
    responsible for:
        - performing any necessary action with the fetched object (e.g.
          create, update or delete a CKAN package).
          Note: if this stage creates or updates a package, a reference
          to the package should be added to the HarvestObject.
        - setting the HarvestObject.package (if there is one)
        - setting the HarvestObject.current for this harvest:
           - True if successfully created/updated
           - False if successfully deleted
        - setting HarvestObject.current to False for previous harvest
          objects of this harvest source if the action was successful.
        - creating and storing any suitable HarvestObjectErrors that may
          occur.
        - creating the HarvestObject - Package relation (if necessary)
        - returning True if the action was done, "unchanged" if the object
          didn't need harvesting after all or False if there were errors.

    NB You can run this stage repeatedly using 'paster harvest import'.

    :param harvest_object: HarvestObject object
    :returns: True if the action was done, "unchanged" if the object didn't
              need harvesting after all or False if there were errors.
    '''

sudo apt-get update
sudo apt-get install supervisor

ps aux | grep supervisord

root      9224  0.0  0.3  56420 12204 ?        Ss   15:52   0:00 /usr/bin/python /usr/bin/supervisord

; ===============================
; ckan harvester
; ===============================

[program:ckan_gather_consumer]

command=/usr/lib/ckan/default/bin/ckan --config=/etc/ckan/default/ckan.ini harvester gather-consumer

; user that owns virtual environment.
user=ckan

numprocs=1
stdout_logfile=/var/log/ckan/std/gather_consumer.log
stderr_logfile=/var/log/ckan/std/gather_consumer.log
autostart=true
autorestart=true
startsecs=10

[program:ckan_fetch_consumer]

command=/usr/lib/ckan/default/bin/ckan --config=/etc/ckan/default/ckan.ini harvester fetch-consumer

; user that owns virtual environment.
user=ckan

numprocs=1
stdout_logfile=/var/log/ckan/std/fetch_consumer.log
stderr_logfile=/var/log/ckan/std/fetch_consumer.log
autostart=true
autorestart=true
startsecs=10

sudo supervisorctl reread
sudo supervisorctl add ckan_gather_consumer
sudo supervisorctl add ckan_fetch_consumer
sudo supervisorctl start ckan_gather_consumer
sudo supervisorctl start ckan_fetch_consumer

sudo supervisorctl status

ckan_fetch_consumer              RUNNING    pid 6983, uptime 0:22:06
ckan_gather_consumer             RUNNING    pid 6968, uptime 0:22:45

sudo service supervisor start; sudo service supervisor stop

`socket.error: [Errno 111] Connection refused`
RabbitMQ is not running::

  sudo service rabbitmq-server start

sudo crontab -e -u ckan

sudo crontab -e -u ckan

@toolkit.chained_action
def harvest_get_notifications_recipients(up_func, context, data_dict):
    """ Harvester plugin notify by default about harvest jobs only to
            admin users of the related organization.
            Also allow to add custom recipients with this function.

        Return a list of dicts with name and email like
            {'name': 'John', 'email': 'john@source.com'} """

    recipients = up_func(context, data_dict)
    new_recipients = []

    # you custom logic to add new_recipients here
    # new_recipients.append({'name': 'Harvester Admin', 'email': 'admin@harvester-team.com'})
    # recipients += new_recipients
    return recipients
cd ckanext-harvest
pytest --ckan-ini=test.ini ckanext/harvest/tests

bellisk/ckanext-harvest

ckanext-harvest - Remote harvesting extension

Installation

Configuration

Database logger configuration(optional)

Dataset name generation configuration (optional)

Send error mails when harvesting fails (optional)

Set a timeout for a harvest job (optional)

Avoid overwriting certain fields (optional)

Command line interface

Authorization

The CKAN harvester

The harvesting interface

Running the harvest jobs

harvester run-test

harvester run

Setting up the harvesters on a production server

Extensible actions

Recipients on harvest jobs notifications

Tests

Harvest API

Releases

Community

Contributing

License