-
Notifications
You must be signed in to change notification settings - Fork 20
Configuration
The package BatchJobs tries to find a configuration at three different possible locations:
- The package installation directory,
- your user's home directory, or
- the working directory of your current R session.
For 2 and 3 the file must be called .BatchJobs.R
.
The config file deployed with the package (1) is called BatchJobs_global_config.R
and resides in the etc
subfolder of your package installation directory.
Editing it would potentially allow a system administrator to setup a basic global configuration for all users.
If more than one configuration file is found, all are used but settings in the more specific file (3 is more specific than 2 which in turn is more specific than 1) overwrite those made in a less specific configuration file.
The default settings are meant for interactive usage (using makeClusterFunctionsInteractive) and do not send any status emails. This should allow you to try out BatchJobs locally without any prior configuration or setup. While the configuration file is a standard R file and can potentially contain any valid R code we encourage you to only set the mandatory configuration variables. The configuration file(s) may include the following configuration variables (you are allowed to leave options out and then fall back to the respective default):
cluster.functions = makeClusterFunctionsInteractive()
mail.start = "first+last"
mail.done = "first+last"
mail.error = "all"
mail.from = "<[email protected]>"
mail.to = "<[email protected]>"
mail.control = list(smtpServer="my.mail.server.com")
staged.queries = FALSE
The variable cluster.functions
determines your batch system. Please see Cluster-Functions and SSH-Cluster for possible implementations and detailed information on how to set them up.
mail.start
, mail.done
and mail.error
concern the sending of status mails when jobs start, successfully terminate or terminate due to an exception.
They can be set to:
- 'none' = do not mail for any job
- 'all' = mail for all jobs
- 'first' = mail for first job
- 'last' = mail for last job
- 'first+last' = mail for first and last job
mail.to
andmail.from
are the sender and recipient addresses for status mails. The sender address does not necessarily have to exist. Enclose them in <> brackets as in the above example.
mail.control
is a control structure for sendmailR.
Please consult your local system administrator to find a suitable local mail exchange which will handle all mail delivery.
The option staged.queries
enables a mechanism where communication with the data base is restricted to the master process.
If you are relying on a network file system (i.e. you are on a batch system or use the ad-hoc SSH-cluster) you should set this to TRUE
to avoid file system locks.
Which, if any, resources must be specified for allocation is highly dependent on the cluster functions you use and the local setup of your cluster.
Please consult your local administrator to see which resources you must request. Having said that, the default.resources
variable can be used to define default resource limits used for all jobs.
A possible configuration entry might look like the following:
default.resources = list(queue="my_queue", walltime=3600)
Additional resource specifications that you provide during job submission will overwrite these default values. We encourage you to use conservative values as the defaults. This avoids mishaps where you waste computational resources because of a broken, long running job.
Furthermore you can set the option max.concurrent.jobs
if your scheduler is configured with a hard per-user jobs limit.
After you have loaded the BatchJobs package in R, you can inspect the current configuration by calling the getConfig function. Here is the output after a fresh installation of BatchJobs with no configuration file:
> getConfig()
BatchJobs configuration:
cluster functions: Interactive
mail.from:
mail.to:
mail.start: none
mail.done: none
mail.error: none
default.resources:
debug: FALSE
raise.warnings: FALSE
staged.queries: FALSE
max.concurrent.jobs: Inf
You may load a specific configuration file using the loadConfig function. It raises an error if the configuration file does not exist. Furthermore you may find setConfig useful.
If you run many short lived jobs or your cluster is very large, the database may become a bottleneck.
In that case, you may wish to set staged.queries
to TRUE
.
This will stage all queries on the slaves to the file system for later execution by one of the head nodes.
If set, slaves never writes to the database.
Instead they write out all update queries to local files.
The next time the master queries the database for status information he will automatically read the staged queries from the shared file system and execute them.
In this way the database is always in an synchronized state at least from your (the user's) perspective.
The only disadvantage is that you will experience a small delay on the head node when call a DB querying function for the first time after jobs have run for some longer time.
On the upside, there will be no contention for a write lock on the database by the slaves, ensuring fast execution.
Version 1.3 of the package will include the option fs.timeout
which makes the package wait for created files up to fs.timeout
seconds before throwing an exception.
If you experience disappearing jobs or get a lot "file not found" errors, please try to set this option to at least 10 seconds.
Note that this feature is per default disabled (fs.timeout == NA
).
If you encounter problems on your batch system and you suspect this is due to a bug in how the package operates with the OS batch commands, you can set
debug = TRUE
in your configuration file and run a simple test. This will display all generated OS commands in R and their resulting output. Provide us with this output so we can fix the bug.
If your jobs fail and you only get a warning instead of an error from the code, you can set raise.warnings
in your configuration file to TRUE
which will make sure that all warnings raised by code during job execution are treated as an error.
Technically this is equivalent to options(warn = 2)
on the slave.