equinor · nilsgstrabo · Dec 13, 2024 · Dec 11, 2024 · Dec 11, 2024 · Dec 13, 2024
diff --git a/examples/radix-example-oauth-proxy/api/Dockerfile b/examples/radix-example-oauth-proxy/api/Dockerfile
@@ -1,4 +1,4 @@
-FROM node:20.8.0-alpine3.17
+FROM node:20-alpine3.21
 
 # Create app directory
 WORKDIR /app

diff --git a/public-site/docs/guides/jobs/configure-jobs.md b/public-site/docs/guides/jobs/configure-jobs.md
@@ -30,6 +30,12 @@ spec:
       schedulerPort: 9000
       timeLimitSeconds: 100
       backoffLimit: 5
+      failurePolicy:
+        rules:
+          - action: FailJob
+            onExitCodes:
+              operator: In
+              values: [42]
       notifications:
         webhook: http://api:8080/monitor-batch-status
       resources:
@@ -47,70 +53,4 @@ spec:
           jobStatuses:
             - Failed
           batchStatus: Failed
-```
-
-## Options
-They share many of the same configuration options with a few exceptions.
-
-A job does not have `publicPort`, `ingressConfiguration`, `replicas`, `horizontalScaling` and `alwaysPullImageOnDeploy`
-
-- `publicPort` and `ingressConfiguration` controls exposure of component to the Internet. Jobs cannot be exposed to the Internet, so these options are not applicable.
-- `replicas` and `hortizontalScaling` controls how many containers of a Docker image a component should run. A job has always one replica.
-- `alwaysPullImageOnDeploy` is used by Radix to restart components that use static Docker image tags, and pulling the newest image if the SHA has changed. Jobs will always pull and check the SHA of the cached image with the SHA of the source image.
-
-Jobs have three extra configuration options; `schedulerPort`, `payload` and `timeLimitSeconds`
-
-- `schedulerPort` (required) defines the port of job-scheduler's endpoint.
-- `payload` (optional) defines the directory in the job container where the payload received by the job-scheduler is mounted.
-- `resources` (optional) defines cpu and memory requested for a job.
-- `node` (optional) defines gpu node requested for a job.
-- `timeLimitSeconds` (optional) defines maximum running time for a job.
-- `backoffLimit` (optional) defines the number of times a job will be restarted if its container exits in error.
-- `notifications.webhook` (optional) the Radix application component or job component endpoint, where Radix batch events will be posted when any of its job-component's running jobs or batches changes states.
-- `batchStatusRules` - (optional) rules to define batch statuses by their jobs statuses. See [batchStatusRules](/radix-config/index.md#batchstatusrules) for a job for more information.
-
-### schedulerPort
-
-In the [`radixconfig.yaml`](/radix-config/index.md#schedulerport) example above, two jobs are defined: `compute` and `etl`.
-
-`compute` has `schedulerPort` set to 8000, and Radix will create a job-scheduler service named compute that listens for HTTP requests on port 8000. The URL for the compute job-scheduler is `http://compute:8000`
-
-The job-scheduler for the `etl` job listens for HTTP requests on port 9000, and the URL is `http://etl:9000`
-
-### payload
-
-Arguments required by a job is sent in the request body to the job-scheduler as a JSON document with an element named `payload`.
-The content of the payload is then mounted in the job container as a file named `payload` in the directory specified in `payload.path` in [`radixconfig.yaml`](/radix-config/index.md#payload).
-The data type of the `payload` value is string, and it can therefore contain any type of data (text, json, binary) as long as you encode it as a string, e.g. base64, when sending it to the job-scheduler, and decoding it when reading it from the mounted file inside the job container. The max size of the payload is 1MB.
-
-The compute job in the example above has `payload.path` set to `/compute/args`. Any payload, send to the compute job-scheduler, will available inside the job container in the file `/compute/args/payload`
-
-### resources
-
-The resource requirement for a job can be sent in the request body to the job manager as a JSON document with an element named `resources`.
-The content of the resources will be used to set the resource definition for the job [`radixconfig.yaml`](/radix-config/index.md#resources-common).
-The data type of the `resources` is of type `ResourceRequirements` an requires this specific format.
-
-The etl job in the example above has `resource` configured.
-
-[More details](/guides/resource-request/index.md) about `resources` and about [default resources](/guides/resource-request/index.md#default-resources).
-
-### node
-
-The node requirerement for a job can be sent in the request body to the job manager as a JSON document with an element named `node`.
-The content of the node will be used to set the node definition for the job [`radixconfig.yaml`](/radix-config/index.md#node).
-The data type of the `node` is of type `RadixNode` an requires this specific format.
-
-The etl job in the example above has `node` configured.
-
-### timeLimitSeconds
-
-The maximum running time for a job can be sent in the request body to the job manager as a JSON document with an element named `timeLimitSeconds`.
-
-The etl job in the example above has `timeLimitSeconds` configured in its [`radixconfig.yaml`](/radix-config/index.md#timelimitseconds). If a new job is sent to the job manager without an element `timeLimitSeconds`, it will default to the value specified in radixconfig.yaml. If no value is specified in radixconfig.yaml, it will default to 43200 (12 hours).
-
-### backoffLimit
-
-The maximum number of restarts if the job fails can be sent in the request body to the job manager as a JSON document with an element named `backoffLimit`.
-
-The etl job in the example above has `backoffLimit` configured in its [`radixconfig.yaml`](/radix-config/index.md#backofflimit). If a new job is sent to the job manager without an element `backoffLimit`, it will default to the value specified in radixconfig.yaml.
+```
diff --git a/public-site/docs/guides/jobs/job-manager-and-job-api.md b/public-site/docs/guides/jobs/job-manager-and-job-api.md
@@ -34,6 +34,17 @@ The Job Manager exposes the following methods for managing jobs:
   "imageTagName": "1.0.0",
   "timeLimitSeconds": 120,
   "backoffLimit": 10,
+  "failurePolicy": {
+    "rules": [
+      {
+        "action": "FailJob",
+        "onExitCodes": {
+          "operator": "In",
+          "values": [42]
+        }
+      }
+    ]
+  },
   "resources": {
     "limits": {
       "memory": "32Mi",
@@ -51,7 +62,7 @@ The Job Manager exposes the following methods for managing jobs:
 }
 ```
 
- `payload`, `jobId`, `imageTagName`, `timeLimitSeconds`, `backoffLimit`, `resources` and `node` are all optional fields and any of them can be omitted in the request.
+ `payload`, `jobId`, `imageTagName`, `timeLimitSeconds`, `backoffLimit`, `failurePolicy`, `resources` and `node` are all optional fields and any of them can be omitted in the request.
 
 `imageTagName` field allows to alter specific job image tag. In order to use it, the `{imageTagName}` need to be set as described in the [`radixconfig.yaml`](/radix-config/index.md#imagetagname)
 

diff --git a/public-site/docs/radix-config/index.md b/public-site/docs/radix-config/index.md
@@ -1257,6 +1257,8 @@ spec:
 
 The port number that the [job-scheduler](/guides/jobs/job-manager-and-job-api.md) will listen to for HTTP requests to manage jobs. schedulerPort is a **required** field.
 
+In the example above, the URL for the compute job-scheduler is `http://compute:8000`
+
 ### `notifications`
 
 ```yaml
@@ -1349,7 +1351,9 @@ spec:
 ```
 
 Job specific arguments must be sent in the request body to the [job-scheduler](/guides/jobs/job-manager-and-job-api.md) as a JSON document with an element named `payload` and a value of type string.
-The content of the payload is then mounted into the job container as a file named `payload` in the directory specified in the `payload.path`.
+
+The data type of the `payload` value is string, and it can therefore contain any type of data (text, json, binary) as long as you encode it as a string, e.g. base64, when sending it to the job-scheduler, and decoding it when reading it from the mounted file inside the job container. The content of the payload is then mounted into the job container as a file named `payload` in the directory specified in the `payload.path`. The max size of the payload is 1MB.
+
 In the example above, a payload sent to the job-scheduler will be mounted as file `/compute/args/payload`
 
 ### `resources`
@@ -1421,9 +1425,7 @@ spec:
       timeLimitSeconds: 120
 ```
 
-The maximum number of seconds a job can run. If the job's running time exceeds the limit, it will be automatically stopped with status `Failed`. The default value is `43200` seconds, 12 hours.
-
-`timeLimitSeconds` applies to the total duration of the job, and takes precedence over `backoffLimit`. Once `timeLimitSeconds` has been reached, the job will be stopped with status `Failed` even if `backoffLimit` has not been reached.
+The maximum number of seconds a job can run, with a default value of `43200` seconds (12 hours). If the job's running time exceeds the limit, a SIGTERM signal is sent to allow the job to gracefully shut down with a 30 second time limit, after which it will be forcefully terminated.
 
 ### `backoffLimit`
 
@@ -1436,6 +1438,43 @@ spec:
 
 Defines the number of times a job will be restarted if its container exits in error. Once the `backoffLimit` has been reached the job will be marked as `Failed`. The default value is `0`.
 
+### `failurePolicy`
+
+```yaml
+spec:
+  jobs:
+    - name: compute
+      backoffLimit: 5
+      failurePolicy:
+        rules:
+          - action: FailJob
+            onExitCodes:
+              operator: In
+              values: [1]
+          - action: Ignore
+            onExitCodes:
+              operator: In
+              values: [143]
+      environmentConfig:
+        - environment: prod
+          failurePolicy:
+            rules:
+              - action: FailJob
+                onExitCodes:
+                  operator: In
+                  values: [42]
+```
+
+`failurePolicy` defines how job container failures should be handled based on the exit code. When a job container exits with a non-zero exit code, it is evaluated against the `rules` in the order they are defined. Once a rule matches the exit code, the remaining rules are ignored, and the defined `action` is performed. When no rule matches the exit code, the default handling is applied.
+
+Possible values for `action` are:
+- `FailJob`: indicates that the job should be marked as `Failed`, even if [`backoffLimit`](#backofflimit) has not been reached.
+- `Ignore`: indicates that the counter towards [`backoffLimit`](#backofflimit) should not be incremented.
+- `Count`: indicates that the job should be handled the default way. The counter towards [`backoffLimit`](#backofflimit) is incremented.
+
+
+`failurePolicy` can be configured on the job level, or in `environmentConfig` for a specific environment. Configuration in `environmentConfig` will override all rules defined on the job level. 
+
 ### `volumeMounts`
 
 ```yaml
@@ -1484,7 +1523,7 @@ spec:
 
 See [notifications](#notifications) for a component for more information.
 
-### `batchStatusRules`
+#### `batchStatusRules`
 
 ```yaml
 spec:
@@ -1624,6 +1663,33 @@ spec:
 
 See [backoffLimit](#backofflimit) for more information.
 
+#### `failurePolicy`
+
+```yaml
+spec:
+  jobs:
+    - name: compute
+      environmentConfig:
+        - environment: prod
+          backoffLimit: 5
+          failurePolicy:
+            rules:
+              - action: FailJob
+                onExitCodes:
+                  operator: In
+                  values: [42]
+              - action: Count
+                onExitCodes:
+                  operator: In
+                  values: [1, 2, 3]
+              - action: Ignore
+                onExitCodes:
+                  operator: In
+                  values: [143]
+```
+
+See [failurePolicy](#failurepolicy) for more information.
+
 #### `readOnlyFileSystem`
 
 ```yaml