- Framework Interop
- Container EnvironmentVariable
- Pod Failure Classification
- Predefined CompletionCode
- CompletionStatus
- RetryPolicy
- FrameworkAttemptCompletionPolicy
- Framework and Pod History
- Controller Extension
- Best Practice
As Framework is actually a Kubernetes CRD, all CRD Clients can be used to interoperate with it, such as:
- kubectl
kubectl create -f {Framework File Path} # Note this is not Foreground Deletion, see [DELETE Framework] section kubectl delete framework {FrameworkName} kubectl get framework {FrameworkName} kubectl describe framework {FrameworkName} kubectl get frameworks kubectl describe frameworks ...
- Kubernetes Client Library
- Any HTTP Client
API Kind | Operations |
---|---|
Framework | CREATE PATCH DELETE GET LIST WATCH WATCH_LIST |
ConfigMap | All operations except for CREATE PUT PATCH |
Pod | All operations except for CREATE PUT PATCH |
Request
POST /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks
Body: Framework
Type: application/json or application/yaml
Description
Create the specified Framework.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
Created(201) | Framework | Return current Framework. |
Accepted(202) | Framework | Return current Framework. |
Conflict(409) | Status | The specified Framework already exists. |
Request
PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Body:
{
"spec": {
"executionType": "Stop"
}
}
Type: application/merge-patch+json
Description
Stop the specified Framework:
All running containers of the Framework will be stopped while the object of the Framework is still kept.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
NotFound(404) | Status | The specified Framework is not found. |
Request
DELETE /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Body:
application/json
{
"propagationPolicy": "Foreground"
}
application/yaml
propagationPolicy: Foreground
Type: application/json or application/yaml
Description
Delete the specified Framework.
Notes:
- If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the Foreground Deletion in the provided body, see Framework Notes. However,
kubectl delete
does not support to specify the Foreground Deletion at least for Kubernetes v1.14.2, so you may have to use other Supported Client.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | The specified Framework is deleting. Return current Framework. |
OK(200) | Status | The specified Framework is deleted. |
NotFound(404) | Status | The specified Framework is not found. |
Request
GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Description
Get the specified Framework.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
NotFound(404) | Status | The specified Framework is not found. |
Request
GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/frameworks
QueryParameters: Same as StatefulSet QueryParameters
Description
Get all Frameworks (in the specified FrameworkNamespace).
Response
Code | Body | Description |
---|---|---|
OK(200) | FrameworkList | Return all Frameworks (in the specified FrameworkNamespace). |
Request
GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
QueryParameters: Same as StatefulSet QueryParameters
Description
Watch the change events of the specified Framework.
Response
Code | Body | Description |
---|---|---|
OK(200) | WatchEvent | Streaming the change events of the specified Framework. |
NotFound(404) | Status | The specified Framework is not found. |
Request
GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/watch/frameworks
QueryParameters: Same as StatefulSet QueryParameters
Description
Watch the change events of all Frameworks (in the specified FrameworkNamespace).
Response
Code | Body | Description |
---|---|---|
OK(200) | WatchEvent | Streaming the change events of all Frameworks (in the specified FrameworkNamespace). |
You can specify how to classify and summarize Pod failures by PodFailureSpec.
You can leverage the Predefined CompletionCode to instruct your RetryPolicy and identify a certain predefined CompletionCode, regardless of different PodFailureSpec may be configured in different clusters.
CompletionStatus: It is generated from Predefined CompletionCode or PodPattern matching. For a Pod, if no PodPattern is matched and failed Container exists, the CompletionCode is the same as the last failed Container ExitCode.
TaskAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a TaskAttempt.
FrameworkAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a FrameworkAttempt.
Notes:
-
Italic Conditions can be inherited from the DEFAULT RetryPolicy, so no need to specify them explicitly.
You still need to specify them explicitly, as we have not supported the Framework Spec Defaulting yet.
-
For the definition of each CompletionType, such as Transient Failed, see CompletionStatus.
FrameworkType | Framework RetryPolicy | TaskRole | Task RetryPolicy | Description |
---|---|---|---|---|
DEFAULT | FancyRetryPolicy = false MaxRetryCount = 0 |
TaskRole-A | FancyRetryPolicy = false MaxRetryCount = 0 |
The default RetryPolicy: Never Retry for any Failed or Succeeded. |
TaskRole-B | FancyRetryPolicy = false MaxRetryCount = 0 |
|||
Service | FancyRetryPolicy = false MaxRetryCount = -2 |
TaskRole-A | FancyRetryPolicy = false MaxRetryCount = -2 |
Always Retry for any Failed or Succeeded. |
Blind Batch | FancyRetryPolicy = false MaxRetryCount = -1 |
TaskRole-A | FancyRetryPolicy = false MaxRetryCount = -1 |
Always Retry for any Failed. Never Retry for Succeeded. |
Batch with Task Fault Tolerance | FancyRetryPolicy = true MaxRetryCount = 3 |
TaskRole-A | FancyRetryPolicy = true MaxRetryCount = 3 |
Always Retry for Transient Failed. Never Retry for Permanent Failed or Succeeded. Retry up to 3 times for Unknown Failed. |
Batch without Task Fault Tolerance | FancyRetryPolicy = true MaxRetryCount = 3 |
TaskRole-A | FancyRetryPolicy = false MaxRetryCount = 0 |
For Framework RetryPolicy, same as "Batch with Task Fault Tolerance". For Task RetryPolicy, because the Task cannot tolerate any failed TaskAttempt, such as it cannot recover from previous failed TaskAttempt, so Never Retry Task for any Failed or Succeeded. |
Debug Mode | FancyRetryPolicy = true MaxRetryCount = 0 |
TaskRole-A | FancyRetryPolicy = true MaxRetryCount = 0 |
Always Retry for Transient Failed. Never Retry for Permanent Failed or Unknown Failed or Succeeded. This can help to capture the unexpected exit of user application itself. |
Notes:
-
Italic Conditions can be inherited from the DEFAULT FrameworkAttemptCompletionPolicy, so no need to specify them explicitly.
You still need to specify them explicitly, as we have not supported the Framework Spec Defaulting yet.
FrameworkType | TaskRole | FrameworkAttemptCompletionPolicy | Description |
---|---|---|---|
DEFAULT | TaskRole-A | MinFailedTaskCount = 1 MinSucceededTaskCount = -1 |
The default FrameworkAttemptCompletionPolicy: Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt until all Tasks succeeded. |
TaskRole-B | MinFailedTaskCount = 1 MinSucceededTaskCount = -1 |
||
Service | TaskRole-A | MinFailedTaskCount = 1 MinSucceededTaskCount = -1 |
Actually, any FrameworkAttemptCompletionPolicy is fine, since Service's Task will never complete, i.e. its Task's MaxRetryCount is -2, see RetryPolicy Example. |
MapReduce | Map | MinFailedTaskCount = {Map.TaskNumber} * {mapreduce.map.failures.maxpercent} + 1 MinSucceededTaskCount = -1 |
A few failed Tasks is acceptable, but always want to wait all Tasks to succeed: Fail the FrameworkAttempt immediately if the failed Tasks exceeded the limit. Succeed the FrameworkAttempt until all Tasks completed and the failed Tasks is within the limit. |
Reduce | MinFailedTaskCount = {Reduce.TaskNumber} * {mapreduce.reduce.failures.maxpercent} + 1 MinSucceededTaskCount = -1 |
||
TensorFlow | ParameterServer | MinFailedTaskCount = 1 MinSucceededTaskCount = -1 |
Succeed a certain TaskRole is enough, and do not want to wait all Tasks to succeed: Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt immediately if Worker's all Tasks succeeded. |
Worker | MinFailedTaskCount = 1 MinSucceededTaskCount = {Worker.TaskNumber} |
||
Arbitrator Dominated | Arbitrator | MinFailedTaskCount = 1 MinSucceededTaskCount = 1 |
The FrameworkAttemptCompletionPolicy is fully delegated to the single instance arbitrator of the user application: Fail the FrameworkAttempt immediately if the arbitrator failed. Succeed the FrameworkAttempt immediately if the arbitrator succeeded. |
TaskRole-A | MinFailedTaskCount = -1 MinSucceededTaskCount = -1 |
||
TaskRole-B | MinFailedTaskCount = -1 MinSucceededTaskCount = -1 |
||
First Completed Task Dominated | TaskRole-A | MinFailedTaskCount = 1 MinSucceededTaskCount = 1 |
The FrameworkAttemptCompletionPolicy is fully delegated to the first completed Task of the user application: Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt immediately if any Task succeeded. |
By leveraging LogObjectSnapshot, external systems, such as Fluentd and ElasticSearch, can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as persistence, metrics conversion, visualization, alerting, acting, analysis, etc.
- Usage
- Example: FrameworkBarrier Example, TensorFlow Example, etc.
- Usage
- Example: TensorFlow Example, etc.