[go: nahoru, domu]

Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Commit

Permalink
Support PodGracefulDeletionTimeoutSec to tune Framework Consistency v…
Browse files Browse the repository at this point in the history
…s Availability (#43)
  • Loading branch information
yqwang-ms committed Sep 19, 2019
1 parent 4237316 commit 77ec4ab
Show file tree
Hide file tree
Showing 8 changed files with 314 additions and 145 deletions.
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,12 @@ A Framework represents an application with a set of Tasks:
4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
5. With fine grained [RetryPolicy](doc/user-manual.md#RetryPolicy) for each Task and the whole Framework
6. With fine grained [FrameworkAttemptCompletionPolicy](doc/user-manual.md#FrameworkAttemptCompletionPolicy) for each TaskRole
7. Guarantees at most one instance of a specific Task is running at any point in time
8. Guarantees at most one instance of a specific Framework is running at any point in time
7. With PodGracefulDeletionTimeoutSec for each Task to [tune Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability)

### Controller Feature
1. Highly generalized as it is built for all kinds of applications
2. Light-weight as it is only responsible for Pod orchestration
3. Well-defined Framework consistency, state machine and failure model
3. Well-defined Framework [Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability), [State Machine](doc/user-manual.md#FrameworkTaskStateMachine) and [Failure Model](doc/user-manual.md#CompletionStatus)
4. Tolerate Pod/ConfigMap unexpected deletion, Node/Network/FrameworkController/Kubernetes failure
5. Support to specify how to [classify and summarize Pod failures](doc/user-manual.md#PodFailureClassification)
6. Support to expose [Framework and Pod history snapshots](doc/user-manual.md#FrameworkPodHistory) to external systems
Expand Down
86 changes: 85 additions & 1 deletion doc/user-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
- [RetryPolicy](#RetryPolicy)
- [FrameworkAttemptCompletionPolicy](#FrameworkAttemptCompletionPolicy)
- [Framework and Pod History](#FrameworkPodHistory)
- [Framework and Task State Machine](#FrameworkTaskStateMachine)
- [Framework Consistency vs Availability](#FrameworkConsistencyAvailability)
- [Controller Extension](#ControllerExtension)
- [FrameworkBarrier](#FrameworkBarrier)
- [HivedScheduler](#HivedScheduler)
Expand Down Expand Up @@ -116,7 +118,8 @@ Type: application/json or application/yaml
Delete the specified Framework.

Notes:
* If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body, see [Framework Notes](../pkg/apis/frameworkcontroller/v1/types.go). However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.14.2](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient).
* If you need to achieve all the [Framework ConsistencyGuarantees](#ConsistencyGuarantees) or achieve higher [Framework Availability](#FrameworkAvailability) by leveraging the [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go), you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body.
* However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.14.2](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient).

**Response**

Expand Down Expand Up @@ -370,6 +373,87 @@ Notes:
## <a name="FrameworkPodHistory">Framework and Pod History</a>
By leveraging [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as persistence, metrics conversion, visualization, alerting, acting, analysis, etc.

## <a name="FrameworkTaskStateMachine">Framework and Task State Machine</a>
### <a name="FrameworkStateMachine">Framework State Machine</a>
[FrameworkState](../pkg/apis/frameworkcontroller/v1/types.go)

### <a name="TaskStateMachine">Task State Machine</a>
[TaskState](../pkg/apis/frameworkcontroller/v1/types.go)

## <a name="FrameworkConsistencyAvailability">Framework Consistency vs Availability</a>
### <a name="FrameworkConsistency">Framework Consistency</a>
#### <a name="ConsistencyGuarantees">ConsistencyGuarantees</a>
For a specific Task identified by {FrameworkName}-{TaskRoleName}-{TaskIndex}:

- **ConsistencyGuarantee1**:

At most one instance of the Task is running at any point in time.

- **ConsistencyGuarantee2**:

No instance of the Task is running if it is TaskAttemptCompleted, TaskCompleted or the whole Framework is deleted.

For a specific Framework identified by {FrameworkName}:

- **ConsistencyGuarantee3**:

At most one instance of the Framework is running at any point in time.

- **ConsistencyGuarantee4**:

No instance of the Framework is running if it is FrameworkAttemptCompleted, FrameworkCompleted or the whole Framework is deleted.

#### <a name="ConsistencyGuaranteesHowTo">How to achieve ConsistencyGuarantees</a>

The default behavior is to achieve all the [ConsistencyGuarantees](#ConsistencyGuarantees), if you do not explicitly violate below guidelines:

1. Achieve **ConsistencyGuarantee1**:

Do not [force delete the managed Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods):

1. Do not set [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go) to be not nil.

For example, the default PodGracefulDeletionTimeoutSec is acceptable.

2. Do not delete the managed Pod with [0 GracePeriodSeconds](https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods).

For example, the default Pod deletion is acceptable.

3. Do not delete the Node which runs the managed Pod.

For example, [drain the Node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node) before delete it is acceptable.

*The Task instance can be universally located by its [TaskAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [PodUID](../pkg/apis/frameworkcontroller/v1/types.go).*

*To avoid the Pod is stuck in deleting forever, such as if its Node is down forever, leverage the same approach as [Delete StatefulSet Pod only after the Pod termination has been confirmed](https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods) manually or by your [Cloud Controller Manager](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager).*

2. Achieve **ConsistencyGuarantee2**, **ConsistencyGuarantee3** and **ConsistencyGuarantee4**:
1. Achieve **ConsistencyGuarantee1**.

2. Must delete the managed ConfigMap with [Foreground PropagationPolicy](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion).

For example, the default ConfigMap deletion is acceptable.

3. Must delete the Framework with [Foreground PropagationPolicy](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion).

For example, the default Framework deletion may not be acceptable, since the default PropagationPolicy for Framework object may be Background.

4. Do not change the [OwnerReferences](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents) of the managed ConfigMap and Pods.

*The Framework instance can be universally located by its [FrameworkAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [ConfigMapUID](../pkg/apis/frameworkcontroller/v1/types.go).*

### <a name="FrameworkAvailability">Framework Availability</a>
According to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), in the presence of a network partition, you cannot achieve both consistency and availability at the same time in any distributed system. So you have to make a trade-off between the [Framework Consistency](#FrameworkConsistency) and the [Framework Availability](#FrameworkAvailability).

You can tune the trade-off, such as to achieve higher [Framework Availability](#FrameworkAvailability) by sacrificing the [Framework Consistency](#FrameworkConsistency):
1. Set a small [Pod TolerationSeconds for TaintBasedEvictions](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions)
2. Set a small [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go)
3. Violate other guidelines mentioned in [How to achieve ConsistencyGuarantees](#ConsistencyGuaranteesHowTo), such as manually force delete a problematic Pod.

See more in:
1. [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go)
2. [Pod Safety and Consistency Guarantees](https://github.com/kubernetes/community/blob/ee8998b156031f6b363daade51ca2d12521f4ac0/contributors/design-proposals/storage/pod-safety.md)

## <a name="ControllerExtension">Controller Extension</a>
### <a name="FrameworkBarrier">FrameworkBarrier</a>
1. [Usage](../pkg/barrier/barrier.go)
Expand Down
57 changes: 29 additions & 28 deletions pkg/apis/frameworkcontroller/v1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,7 @@ type FrameworkList struct {
// 4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
// 5. With fine grained RetryPolicy for each Task and the whole Framework
// 6. With fine grained FrameworkAttemptCompletionPolicy for each TaskRole
// 7. Guarantees at most one instance of a specific Task is running at any point in time
// 8. Guarantees at most one instance of a specific Framework is running at any point in time
// 7. With PodGracefulDeletionTimeoutSec for each Task to tune Consistency vs Availability
//
// Notes:
// 1. Status field should only be modified by FrameworkController, and
Expand All @@ -57,26 +56,6 @@ type FrameworkList struct {
// Leverage CRD status subresource to isolate Status field modification with other fields.
// This can help to avoid unintended modification, such as users may unintendedly modify
// the status when updating the spec.
// 2. To ensure at most one instance of a specific Task is running at any point in time:
// 1. Do not delete the managed Pod with 0 gracePeriodSeconds.
// For example, the default Pod deletion is acceptable.
// 2. Do not delete the Node which runs the managed Pod.
// For example, drain before delete the Node is acceptable.
// The instance can be universally located by its TaskAttemptInstanceUID or PodUID.
// See RetryPolicySpec and TaskAttemptStatus.
// 3. To ensure at most one instance of a specific Framework is running at any point in time:
// 1. Ensure ensure at most one instance of a specific Task is running at any point in time.
// 2. Do not delete the managed ConfigMap with Background propagationPolicy.
// For example, the default ConfigMap deletion is acceptable.
// 3. Must delete the Framework with Foreground propagationPolicy.
// For example, the default Framework deletion may not be acceptable, since the default
// propagationPolicy for Framework object may be Background.
// The instance can be universally located by its FrameworkAttemptInstanceUID or ConfigMapUID.
// See RetryPolicySpec and FrameworkAttemptStatus.
// 4. To ensure there is no orphan object previously managed by FrameworkController:
// 1. Do not delete the Framework or the managed ConfigMap with Orphan propagationPolicy.
// For example, the default Framework and ConfigMap deletion is acceptable.
// 2. Do not change the OwnerReferences of the managed ConfigMap and Pods.
//////////////////////////////////////////////////////////////////////////////////////////////////
type Framework struct {
meta.TypeMeta `json:",inline"`
Expand Down Expand Up @@ -107,8 +86,31 @@ type TaskRoleSpec struct {
}

type TaskSpec struct {
RetryPolicy RetryPolicySpec `json:"retryPolicy"`
Pod core.PodTemplateSpec `json:"pod"`
RetryPolicy RetryPolicySpec `json:"retryPolicy"`

// If the Task's current associated Pod object is being deleted, i.e. graceful
// deletion, but the graceful deletion cannot finish within this timeout, then
// the Pod will be deleted forcefully by FrameworkController.
// Default to nil.
//
// If this timeout is not nil, the Pod may be deleted forcefully by FrameworkController.
// The force deletion does not wait for confirmation that the Pod has been terminated
// totally, and then the Task will be immediately transitioned to TaskAttemptCompleted.
// As a consequence, the Task will be immediately completed or retried with another
// new Pod, however the old Pod may be still running.
// So, in this setting, the Task behaves like ReplicaSet, and choose it if the Task
// favors availability over consistency, such as stateless Task.
// However, to still best effort execute graceful deletion with the toleration for
// transient deletion failures, this timeout should be at least longer than the Pod
// TerminationGracePeriodSeconds + minimal TolerationSeconds for TaintBasedEvictions.
//
// If this timeout is nil, the Pod will always be deleted gracefully, i.e. never
// be deleted forcefully by FrameworkController. This helps to guarantee at most
// one instance of a specific Task is running at any point in time.
// So, in this setting, the Task behaves like StatefulSet, and choose it if the Task
// favors consistency over availability, such as stateful Task.
PodGracefulDeletionTimeoutSec *int64 `json:"podGracefulDeletionTimeoutSec"`
Pod core.PodTemplateSpec `json:"pod"`
}

type ExecutionType string
Expand Down Expand Up @@ -163,10 +165,9 @@ const (
// So, an attempt identified by its attempt id may be associated with multiple
// attempt instances over time, i.e. multiple instances may be run for the
// attempt over time, however, at most one instance is exposed into ApiServer
// over time and at most one instance is running at any point in time.
// So, the actual retried attempt instances maybe exceed the RetryPolicySpec
// in rare cases, however, the RetryPolicyStatus will never exceed the
// RetryPolicySpec.
// over time.
// So, the actual retried attempt instances may exceed the RetryPolicySpec in
// rare cases, however, the RetryPolicyStatus will never exceed the RetryPolicySpec.
// 2. Resort to other spec to control other kind of RetryPolicy:
// 1. Container RetryPolicy is the RestartPolicy in Pod Spec.
// See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
Expand Down
5 changes: 5 additions & 0 deletions pkg/apis/frameworkcontroller/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pkg/barrier/barrier.go
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,7 @@ func (b *FrameworkBarrier) Run() {
if isPermanentErr {
exit(ci.CompletionCodeContainerPermanentFailed)
} else {
// May also timeout, but still treat as Unknown Error
exit(ci.CompletionCode(1))
}
}
Expand Down
8 changes: 8 additions & 0 deletions pkg/common/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,14 @@ func PtrUIDStr(s string) *types.UID {
return PtrUID(types.UID(s))
}

func PtrDeletionPropagation(o meta.DeletionPropagation) *meta.DeletionPropagation {
return &o
}

func PtrTime(o meta.Time) *meta.Time {
return &o
}

func PtrNow() *meta.Time {
now := meta.Now()
return &now
Expand Down
Loading

0 comments on commit 77ec4ab

Please sign in to comment.