data-services

A place to add Data Services scripts from PO's. Data services are scripts which are used to process incoming data on a per pipeline basis in the data ingestion pipelines.

Licensing

This project is licensed under the terms of the GNU GPLv3 license.

Folder stucture

The suggested naming convention we agreed on with the developers, regarding the different PO's scripts was : [FACILITY_NAME]/[SUB-FACILITY_NAME]_[script_name]

example : FAIMMS/faimms_data_rss_channels_process

Injected Environment Variables

During the deployment of data services (see chef recipe), various environment variables are made available for cronjobs (they may or may not be used). Using them will result in more relocatable and robust scripts.

The environment variables are:

Name	Default	Purpose
$ARCHIVE_DIR	/mnt/ebs/archive	Archive
$ARCHIVE_IMOS_DIR	/mnt/ebs/archive	Archive
$INCOMING_DIR	/mnt/ebs/incoming	Incoming
$ERROR_DIR	/mnt/ebs/error	Dir. to store incoming files that cause pipeline errors
$WIP_DIR	/mnt/ebs/wip	Work In Progress tmp dir
$DATA_SERVICES_DIR	/mnt/ebs/data-services	Where this git repo is deployed
$DATA_SERVICES_TMP_DIR	/mnt/ebs/tmp	Temp dir for data services work (not on root partition like /tmp)
$EMAIL_ALIASES	/etc/incoming-aliases	List of configured aliases
$PYTHONPATH	$DATA_SERVICES_DIR/lib/python	Location of data-services python scripts/modules
$LOG_DIR	/mnt/ebs/log/data-services	Designated log dir
$HARVESTER_TRIGGER	sudo -u talend /mnt/ebs/talend/bin/talend-trigger -c /mnt/ebs/talend/etc/trigger.conf	Command to trigger talend
$S3CMD	s3cmd --config=/mnt/ebs/data-services/s3cfg	Default parameters for the s3cmd utility
$S3_BUCKET		Location of the S3 bucket for this environment

It may be necessary to source additional environment variables that are defined elsewhere. For example, the location of the schema definitions which are defined in the pipeline databags can be sourced from /etc/profile.d/pipeline.sh.

Mocking Environment

In order to mock your environment so you can test things, you can have a script called env.sh for example with the contents of:

export ARCHIVE_DIR='/tmp/archive'
export INCOMING_DIR='/tmp/incoming'
export WIP_DIR='/tmp/wip'
export DATA_SERVICES_DIR="$PWD"
export LOG_DIR='/tmp/log'

mkdir -p $ARCHIVE_DIR $INCOMING_DIR $WIP_DIR $LOG_DIR

Then to test your script with the mocked environment you can run:

$ (source env.sh && YOUR_SCRIPT.sh)

Configuration

Cronjobs

Cronjobs for data-services scripts are managed via chef databags under chef-private/data_bags/cronjobs

Cronjobs are prefixed with po_ in order to differentiate them from other non pipeline-related tasks.

The cronjob must source any necessary environment variables first, followed by your command or script e.g.:

0 21 * * * projectofficer source /etc/profile && $DATA_SERVICES_DIR/yourscript.py

Example data_bag. chef-private/data_bags/cronjobs/po_NRMN.json

{
  "job_name": "po_NRMN",
  "shell": "/bin/bash",
  "minute": "0",
  "hour": "21",
  "user": "projectofficer",
  "command": "source /etc/profile; $DATA_SERVICES_DIR/NRMN/extract.sh",
  "mailto": "benedicte.pasquer@utas.edu.au",
  "monitored": true
}

The following attributes can be used:

Key	Type	Description	Default
['job_name']	String	The ID/name of the cronjob (mandatory)
['shell']	String	The shell to use for the script/command (mandatory)
['user']	String	User that will run the script/command (mandatory)
['command']	String	Command or script to be run (must be valid bash and must be able to resolve path)
['mailto']	String	User to send report of cronjob command output to	root@localhost
['monitored']	Boolean	Determines whether Nagios will monitor the job or not
['minute']	String	minute to run job on (see crontab syntax below)	*
['hour']	String	hour to run job on (see crontab syntax below)	*
['day']	String	day to run job on (see crontab syntax below)	*
['month']	String	month to run job on (see crontab syntax below)	*
['weekday']	String	weekday to run job on (see crontab syntax below)	*

Crontab syntax:

# m h  dom mon dow   command
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name  command to be executed
0 22  * * *  $username  script.path/script.sh

Your cronjobs need to be defined in the node attributes of the chef-managed node before they will be installed. e.g.:

  "cronjobs": [
    "po_NRMM",
    "po_someother_job",
    "..."
  ]

BecCowley/data-services