Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

425 add documentation for job failurepolicy #428

Merged
merged 4 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/radix-example-oauth-proxy/api/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM node:20.8.0-alpine3.17
FROM node:20-alpine3.21

# Create app directory
WORKDIR /app
Expand Down
74 changes: 7 additions & 67 deletions public-site/docs/guides/jobs/configure-jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ spec:
schedulerPort: 9000
timeLimitSeconds: 100
backoffLimit: 5
failurePolicy:
rules:
- action: FailJob
onExitCodes:
operator: In
values: [42]
notifications:
webhook: http://api:8080/monitor-batch-status
resources:
Expand All @@ -47,70 +53,4 @@ spec:
jobStatuses:
- Failed
batchStatus: Failed
```

## Options
They share many of the same configuration options with a few exceptions.

A job does not have `publicPort`, `ingressConfiguration`, `replicas`, `horizontalScaling` and `alwaysPullImageOnDeploy`

- `publicPort` and `ingressConfiguration` controls exposure of component to the Internet. Jobs cannot be exposed to the Internet, so these options are not applicable.
- `replicas` and `hortizontalScaling` controls how many containers of a Docker image a component should run. A job has always one replica.
- `alwaysPullImageOnDeploy` is used by Radix to restart components that use static Docker image tags, and pulling the newest image if the SHA has changed. Jobs will always pull and check the SHA of the cached image with the SHA of the source image.

Jobs have three extra configuration options; `schedulerPort`, `payload` and `timeLimitSeconds`

- `schedulerPort` (required) defines the port of job-scheduler's endpoint.
- `payload` (optional) defines the directory in the job container where the payload received by the job-scheduler is mounted.
- `resources` (optional) defines cpu and memory requested for a job.
- `node` (optional) defines gpu node requested for a job.
- `timeLimitSeconds` (optional) defines maximum running time for a job.
- `backoffLimit` (optional) defines the number of times a job will be restarted if its container exits in error.
- `notifications.webhook` (optional) the Radix application component or job component endpoint, where Radix batch events will be posted when any of its job-component's running jobs or batches changes states.
- `batchStatusRules` - (optional) rules to define batch statuses by their jobs statuses. See [batchStatusRules](/radix-config/index.md#batchstatusrules) for a job for more information.

### schedulerPort

In the [`radixconfig.yaml`](/radix-config/index.md#schedulerport) example above, two jobs are defined: `compute` and `etl`.

`compute` has `schedulerPort` set to 8000, and Radix will create a job-scheduler service named compute that listens for HTTP requests on port 8000. The URL for the compute job-scheduler is `http://compute:8000`

The job-scheduler for the `etl` job listens for HTTP requests on port 9000, and the URL is `http://etl:9000`

### payload

Arguments required by a job is sent in the request body to the job-scheduler as a JSON document with an element named `payload`.
The content of the payload is then mounted in the job container as a file named `payload` in the directory specified in `payload.path` in [`radixconfig.yaml`](/radix-config/index.md#payload).
The data type of the `payload` value is string, and it can therefore contain any type of data (text, json, binary) as long as you encode it as a string, e.g. base64, when sending it to the job-scheduler, and decoding it when reading it from the mounted file inside the job container. The max size of the payload is 1MB.

The compute job in the example above has `payload.path` set to `/compute/args`. Any payload, send to the compute job-scheduler, will available inside the job container in the file `/compute/args/payload`

### resources

The resource requirement for a job can be sent in the request body to the job manager as a JSON document with an element named `resources`.
The content of the resources will be used to set the resource definition for the job [`radixconfig.yaml`](/radix-config/index.md#resources-common).
The data type of the `resources` is of type `ResourceRequirements` an requires this specific format.

The etl job in the example above has `resource` configured.

[More details](/guides/resource-request/index.md) about `resources` and about [default resources](/guides/resource-request/index.md#default-resources).

### node

The node requirerement for a job can be sent in the request body to the job manager as a JSON document with an element named `node`.
The content of the node will be used to set the node definition for the job [`radixconfig.yaml`](/radix-config/index.md#node).
The data type of the `node` is of type `RadixNode` an requires this specific format.

The etl job in the example above has `node` configured.

### timeLimitSeconds

The maximum running time for a job can be sent in the request body to the job manager as a JSON document with an element named `timeLimitSeconds`.

The etl job in the example above has `timeLimitSeconds` configured in its [`radixconfig.yaml`](/radix-config/index.md#timelimitseconds). If a new job is sent to the job manager without an element `timeLimitSeconds`, it will default to the value specified in radixconfig.yaml. If no value is specified in radixconfig.yaml, it will default to 43200 (12 hours).

### backoffLimit

The maximum number of restarts if the job fails can be sent in the request body to the job manager as a JSON document with an element named `backoffLimit`.

The etl job in the example above has `backoffLimit` configured in its [`radixconfig.yaml`](/radix-config/index.md#backofflimit). If a new job is sent to the job manager without an element `backoffLimit`, it will default to the value specified in radixconfig.yaml.
```
13 changes: 12 additions & 1 deletion public-site/docs/guides/jobs/job-manager-and-job-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,17 @@ The Job Manager exposes the following methods for managing jobs:
"imageTagName": "1.0.0",
"timeLimitSeconds": 120,
"backoffLimit": 10,
"failurePolicy": {
"rules": [
{
"action": "FailJob",
"onExitCodes": {
"operator": "In",
"values": [42]
}
}
]
},
"resources": {
"limits": {
"memory": "32Mi",
Expand All @@ -51,7 +62,7 @@ The Job Manager exposes the following methods for managing jobs:
}
```

`payload`, `jobId`, `imageTagName`, `timeLimitSeconds`, `backoffLimit`, `resources` and `node` are all optional fields and any of them can be omitted in the request.
`payload`, `jobId`, `imageTagName`, `timeLimitSeconds`, `backoffLimit`, `failurePolicy`, `resources` and `node` are all optional fields and any of them can be omitted in the request.

`imageTagName` field allows to alter specific job image tag. In order to use it, the `{imageTagName}` need to be set as described in the [`radixconfig.yaml`](/radix-config/index.md#imagetagname)

Expand Down
76 changes: 71 additions & 5 deletions public-site/docs/radix-config/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1257,6 +1257,8 @@ spec:

The port number that the [job-scheduler](/guides/jobs/job-manager-and-job-api.md) will listen to for HTTP requests to manage jobs. schedulerPort is a **required** field.

In the example above, the URL for the compute job-scheduler is `http://compute:8000`

### `notifications`

```yaml
Expand Down Expand Up @@ -1349,7 +1351,9 @@ spec:
```

Job specific arguments must be sent in the request body to the [job-scheduler](/guides/jobs/job-manager-and-job-api.md) as a JSON document with an element named `payload` and a value of type string.
The content of the payload is then mounted into the job container as a file named `payload` in the directory specified in the `payload.path`.

The data type of the `payload` value is string, and it can therefore contain any type of data (text, json, binary) as long as you encode it as a string, e.g. base64, when sending it to the job-scheduler, and decoding it when reading it from the mounted file inside the job container. The content of the payload is then mounted into the job container as a file named `payload` in the directory specified in the `payload.path`. The max size of the payload is 1MB.

In the example above, a payload sent to the job-scheduler will be mounted as file `/compute/args/payload`

### `resources`
Expand Down Expand Up @@ -1421,9 +1425,7 @@ spec:
timeLimitSeconds: 120
```

The maximum number of seconds a job can run. If the job's running time exceeds the limit, it will be automatically stopped with status `Failed`. The default value is `43200` seconds, 12 hours.

`timeLimitSeconds` applies to the total duration of the job, and takes precedence over `backoffLimit`. Once `timeLimitSeconds` has been reached, the job will be stopped with status `Failed` even if `backoffLimit` has not been reached.
The maximum number of seconds a job can run, with a default value of `43200` seconds (12 hours). If the job's running time exceeds the limit, a SIGTERM signal is sent to allow the job to gracefully shut down with a 30 second time limit, after which it will be forcefully terminated.

### `backoffLimit`

Expand All @@ -1436,6 +1438,43 @@ spec:

Defines the number of times a job will be restarted if its container exits in error. Once the `backoffLimit` has been reached the job will be marked as `Failed`. The default value is `0`.

### `failurePolicy`

```yaml
spec:
jobs:
- name: compute
backoffLimit: 5
failurePolicy:
rules:
- action: FailJob
onExitCodes:
operator: In
values: [1]
- action: Ignore
onExitCodes:
operator: In
values: [143]
environmentConfig:
- environment: prod
failurePolicy:
rules:
- action: FailJob
onExitCodes:
operator: In
values: [42]
```

`failurePolicy` defines how job container failures should be handled based on the exit code. When a job container exits with a non-zero exit code, it is evaluated against the `rules` in the order they are defined. Once a rule matches the exit code, the remaining rules are ignored, and the defined `action` is performed. When no rule matches the exit code, the default handling is applied.

Possible values for `action` are:
- `FailJob`: indicates that the job should be marked as `Failed`, even if [`backoffLimit`](#backofflimit) has not been reached.
- `Ignore`: indicates that the counter towards [`backoffLimit`](#backofflimit) should not be incremented.
- `Count`: indicates that the job should be handled the default way. The counter towards [`backoffLimit`](#backofflimit) is incremented.


`failurePolicy` can be configured on the job level, or in `environmentConfig` for a specific environment. Configuration in `environmentConfig` will override all rules defined on the job level.

### `volumeMounts`

```yaml
Expand Down Expand Up @@ -1484,7 +1523,7 @@ spec:

See [notifications](#notifications) for a component for more information.

### `batchStatusRules`
#### `batchStatusRules`

```yaml
spec:
Expand Down Expand Up @@ -1624,6 +1663,33 @@ spec:

See [backoffLimit](#backofflimit) for more information.

#### `failurePolicy`

```yaml
spec:
jobs:
- name: compute
environmentConfig:
- environment: prod
backoffLimit: 5
failurePolicy:
rules:
- action: FailJob
onExitCodes:
operator: In
values: [42]
- action: Count
onExitCodes:
operator: In
values: [1, 2, 3]
- action: Ignore
onExitCodes:
operator: In
values: [143]
```

See [failurePolicy](#failurepolicy) for more information.

#### `readOnlyFileSystem`

```yaml
Expand Down
Loading