Job Controller

Abstract

A proposal for implementing a new controller - Job controller - which will be responsible for managing pod(s) that require running once to completion even if the machine the pod is running on fails, in contrast to what ReplicationController currently offers.

Several existing issues and PRs were already created regarding that particular subject:

Job Controller #1624
New Job resource #7380

Use Cases

Be able to start one or several pods tracked as a single entity.
Be able to run batch-oriented workloads on Kubernetes.
Be able to get the job status.
Be able to specify the number of instances performing a job at any one time.
Be able to specify the number of successfully finished instances required to finish a job.

Motivation

Jobs are needed for executing multi-pod computation to completion; a good example here would be the ability to implement any type of batch oriented tasks.

Implementation

Job controller is similar to replication controller in that they manage pods. This implies they will follow the same controller framework that replication controllers already defined. The biggest difference between a Job and a ReplicationController object is the purpose; ReplicationController ensures that a specified number of Pods are running at any one time, whereas Job is responsible for keeping the desired number of Pods to a completion of a task. This difference will be represented by the RestartPolicy which is required to always take value of RestartPolicyNever or RestartOnFailure.

The new Job object will have the following content:

// Job represents the configuration of a single job.
type Job struct {
    TypeMeta
    ObjectMeta

    // Spec is a structure defining the expected behavior of a job.
    Spec JobSpec

    // Status is a structure describing current status of a job.
    Status JobStatus
}

// JobList is a collection of jobs.
type JobList struct {
    TypeMeta
    ListMeta

    Items []Job
}

JobSpec structure is defined to contain all the information how the actual job execution will look like.

// JobSpec describes how the job execution will look like.
type JobSpec struct {

    // Parallelism specifies the maximum desired number of pods the job should
    // run at any given time. The actual number of pods running in steady state will
    // be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism),
    // i.e. when the work left to do is less than max parallelism.
    Parallelism *int

    // Completions specifies the desired number of successfully finished pods the
    // job should be run with. Defaults to 1.
    Completions *int

    // Selector is a label query over pods running a job.
    Selector map[string]string

    // Template is the object that describes the pod that will be created when
    // executing a job.
    Template *PodTemplateSpec
}

JobStatus structure is defined to contain information about pods executing specified job. The structure holds information about pods currently executing the job.

// JobStatus represents the current state of a Job.
type JobStatus struct {
    Conditions []JobCondition

    // CreationTime represents time when the job was created
    CreationTime unversioned.Time

    // StartTime represents time when the job was started
    StartTime unversioned.Time

    // CompletionTime represents time when the job was completed
    CompletionTime unversioned.Time

    // Active is the number of actively running pods.
    Active int

    // Successful is the number of pods successfully completed their job.
    Successful int

    // Unsuccessful is the number of pods failures, this applies only to jobs
    // created with RestartPolicyNever, otherwise this value will always be 0.
    Unsuccessful int
}

type JobConditionType string

// These are valid conditions of a job.
const (
    // JobComplete means the job has completed its execution.
    JobComplete JobConditionType = "Complete"
)

// JobCondition describes current state of a job.
type JobCondition struct {
    Type               JobConditionType
    Status             ConditionStatus
    LastHeartbeatTime  unversioned.Time
    LastTransitionTime unversioned.Time
    Reason             string
    Message            string
}

Events

Job controller will be emitting the following events:

JobStart
JobFinish

Future evolution

Below are the possible future extensions to the Job controller:

Be able to limit the execution time for a job, similarly to ActiveDeadlineSeconds for Pods. now implemented
Be able to create a chain of jobs dependent one on another. will be implemented in a separate type called Workflow
Be able to specify the work each of the workers should execute (see type 1 from this comment)
Be able to inspect Pods running a Job, especially after a Job has finished, e.g. by providing pointers to Pods in the JobStatus (see comment).
help users avoid non-unique label selectors (see this proposal)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job.md

job.md

Job Controller

Abstract

Use Cases

Motivation

Implementation

Events

Future evolution

Files

job.md

Latest commit

History

job.md

File metadata and controls

Job Controller

Abstract

Use Cases

Motivation

Implementation

Events

Future evolution