Skip to content

Scripts which are used to process incoming data in the data ingestion pipeline

License

Notifications You must be signed in to change notification settings

B-Stepin/data-services

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-services

data-services

A place to add Data Services scripts from PO's. Data services are scripts which are used to process incoming data on a per pipeline basis in the data ingestion pipelines.

Licensing

This project is licensed under the terms of the GNU GPLv3 license.

Folder stucture

The suggested naming convention we agreed on with the developers, regarding the different PO's scripts was : [FACILITY_NAME]/[SUB-FACILITY_NAME]_[script_name]

example : FAIMMS/faimms_data_rss_channels_process

Injected Environment Variables

During the deployment of data services (see chef recipe), various environment variables are made available for cronjobs (they may or may not be used). Using them will result in more relocatable and robust scripts.

The environment variables are:

Name Default Purpose
$ARCHIVE_DIR /mnt/ebs/archive Archive
$ARCHIVE_IMOS_DIR /mnt/ebs/archive Archive
$INCOMING_DIR /mnt/ebs/incoming Incoming
$ERROR_DIR /mnt/ebs/error Dir. to store incoming files that cause pipeline errors
$WIP_DIR /mnt/ebs/wip Work In Progress tmp dir
$DATA_SERVICES_DIR /mnt/ebs/data-services Where this git repo is deployed
$DATA_SERVICES_TMP_DIR /mnt/ebs/tmp Temp dir for data services work (not on root partition like /tmp)
$EMAIL_ALIASES /etc/incoming-aliases List of configured aliases
$PYTHONPATH $DATA_SERVICES_DIR/lib/python Location of data-services python scripts/modules
$LOG_DIR /mnt/ebs/log/data-services Designated log dir
$HARVESTER_TRIGGER sudo -u talend /mnt/ebs/talend/bin/talend-trigger -c /mnt/ebs/talend/etc/trigger.conf Command to trigger talend
$S3CMD s3cmd --config=/mnt/ebs/data-services/s3cfg Default parameters for the s3cmd utility
$S3_BUCKET Location of the S3 bucket for this environment

It may be necessary to source additional environment variables that are defined elsewhere. For example, the location of the schema definitions which are defined in the pipeline databags can be sourced from /etc/profile.d/pipeline.sh.

Mocking Environment

In order to mock your environment so you can test things, you can have a script called env.sh for example with the contents of:

export ARCHIVE_DIR='/tmp/archive'
export INCOMING_DIR='/tmp/incoming'
export WIP_DIR='/tmp/wip'
export DATA_SERVICES_DIR="$PWD"
export LOG_DIR='/tmp/log'

mkdir -p $ARCHIVE_DIR $INCOMING_DIR $WIP_DIR $LOG_DIR

Then to test your script with the mocked environment you can run:

$ (source env.sh && YOUR_SCRIPT.sh)

Configuration

Cronjobs

Cronjobs for data-services scripts are managed via chef databags under chef-private/data_bags/cronjobs

Cronjobs are prefixed with po_ in order to differentiate them from other non pipeline-related tasks.

The cronjob must source any necessary environment variables first, followed by your command or script e.g.:

0 21 * * * projectofficer source /etc/profile && $DATA_SERVICES_DIR/yourscript.py

Example data_bag. chef-private/data_bags/cronjobs/po_NRMN.json

{
  "job_name": "po_NRMN",
  "shell": "/bin/bash",
  "minute": "0",
  "hour": "21",
  "user": "projectofficer",
  "command": "source /etc/profile; $DATA_SERVICES_DIR/NRMN/extract.sh",
  "mailto": "[email protected]",
  "monitored": true
}

The following attributes can be used:

Key Type Description Default
['job_name'] String The ID/name of the cronjob (mandatory)
['shell'] String The shell to use for the script/command (mandatory)
['user'] String User that will run the script/command (mandatory)
['command'] String Command or script to be run (must be valid bash and must be able to resolve path)
['mailto'] String User to send report of cronjob command output to root@localhost
['monitored'] Boolean Determines whether Nagios will monitor the job or not
['minute'] String minute to run job on (see crontab syntax below) *
['hour'] String hour to run job on (see crontab syntax below) *
['day'] String day to run job on (see crontab syntax below) *
['month'] String month to run job on (see crontab syntax below) *
['weekday'] String weekday to run job on (see crontab syntax below) *

Crontab syntax:

# m h  dom mon dow   command
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name  command to be executed
0 22  * * *  $username  script.path/script.sh

Your cronjobs need to be defined in the node attributes of the chef-managed node before they will be installed. e.g.:

  "cronjobs": [
    "po_NRMM",
    "po_someother_job",
    "..."
  ]

About

Scripts which are used to process incoming data in the data ingestion pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 84.7%
  • Rich Text Format 11.0%
  • R 1.8%
  • MATLAB 1.5%
  • Shell 0.5%
  • HTML 0.2%
  • Other 0.3%