user-manual.md

User Manual

Index

Framework Interop
Container EnvironmentVariable
Pod Failure Classification
Predefined CompletionCode
CompletionStatus
RetryPolicy
FrameworkAttemptCompletionPolicy
Framework and Pod History
Controller Extension
- FrameworkBarrier
- HivedScheduler
Best Practice

Framework Interop

Supported Client

As Framework is actually a Kubernetes CRD, all CRD Clients can be used to interoperate with it, such as:

kubectl create -f {Framework File Path}
# Note this is not Foreground Deletion, see [DELETE Framework] section
kubectl delete framework {FrameworkName}
kubectl get framework {FrameworkName}
kubectl describe framework {FrameworkName}
kubectl get frameworks
kubectl describe frameworks
...

Kubernetes Client Library
Any HTTP Client

Supported Interoperation

API Kind	Operations
Framework	CREATE PATCH DELETE GET LIST WATCH WATCH_LIST
ConfigMap	All operations except for CREATE PUT PATCH
Pod	All operations except for CREATE PUT PATCH

CREATE Framework

Request

POST /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks

Body: Framework

Type: application/json or application/yaml

Description

Create the specified Framework.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
Created(201)	Framework	Return current Framework.
Accepted(202)	Framework	Return current Framework.
Conflict(409)	Status	The specified Framework already exists.

PATCH Framework

Stop Framework

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

{
  "spec": {
    "executionType": "Stop"
  }
}

Type: application/merge-patch+json

Description

Stop the specified Framework:

All running containers of the Framework will be stopped while the object of the Framework is still kept.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
NotFound(404)	Status	The specified Framework is not found.

DELETE Framework

Request

DELETE /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

application/json

{
  "propagationPolicy": "Foreground"
}

application/yaml

propagationPolicy: Foreground

Type: application/json or application/yaml

Description

Delete the specified Framework.

Notes:

If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the Foreground Deletion in the provided body, see Framework Notes. However, kubectl delete does not support to specify the Foreground Deletion at least for Kubernetes v1.14.2, so you may have to use other Supported Client.

Response

Code	Body	Description
OK(200)	Framework	The specified Framework is deleting. Return current Framework.
OK(200)	Status	The specified Framework is deleted.
NotFound(404)	Status	The specified Framework is not found.

GET Framework

Request

GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Description

Get the specified Framework.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
NotFound(404)	Status	The specified Framework is not found.

LIST Frameworks

Request

GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/frameworks

QueryParameters: Same as StatefulSet QueryParameters

Description

Get all Frameworks (in the specified FrameworkNamespace).

Response

Code	Body	Description
OK(200)	FrameworkList	Return all Frameworks (in the specified FrameworkNamespace).

WATCH Framework

Request

GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

QueryParameters: Same as StatefulSet QueryParameters

Description

Watch the change events of the specified Framework.

Response

Code	Body	Description
OK(200)	WatchEvent	Streaming the change events of the specified Framework.
NotFound(404)	Status	The specified Framework is not found.

WATCH_LIST Frameworks

Request

GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/watch/frameworks

QueryParameters: Same as StatefulSet QueryParameters

Description

Watch the change events of all Frameworks (in the specified FrameworkNamespace).

Response

Code	Body	Description
OK(200)	WatchEvent	Streaming the change events of all Frameworks (in the specified FrameworkNamespace).

Container EnvironmentVariable

Pod Failure Classification

You can specify how to classify and summarize Pod failures by PodFailureSpec.

Predefined CompletionCode

You can leverage the Predefined CompletionCode to instruct your RetryPolicy and identify a certain predefined CompletionCode, regardless of different PodFailureSpec may be configured in different clusters.

CompletionStatus

CompletionStatus: It is generated from Predefined CompletionCode or PodPattern matching. For a Pod, if no PodPattern is matched and failed Container exists, the CompletionCode is the same as the last failed Container ExitCode.

TaskAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a TaskAttempt.

FrameworkAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a FrameworkAttempt.

RetryPolicy

Spec

RetryPolicySpec

Usage

RetryPolicySpec

Example

Notes:

Italic Conditions can be inherited from the DEFAULT RetryPolicy, so no need to specify them explicitly.

You still need to specify them explicitly, as we have not supported the Framework Spec Defaulting yet.
For the definition of each CompletionType, such as Transient Failed, see CompletionStatus.

FrameworkType	Framework RetryPolicy	TaskRole	Task RetryPolicy	Description
DEFAULT	FancyRetryPolicy = false MaxRetryCount = 0	TaskRole-A	FancyRetryPolicy = false MaxRetryCount = 0	The default RetryPolicy: Never Retry for any Failed or Succeeded.
DEFAULT	FancyRetryPolicy = false MaxRetryCount = 0	TaskRole-B	FancyRetryPolicy = false MaxRetryCount = 0
Service	FancyRetryPolicy = false MaxRetryCount = -2	TaskRole-A	FancyRetryPolicy = false MaxRetryCount = -2	Always Retry for any Failed or Succeeded.
Blind Batch	FancyRetryPolicy = false MaxRetryCount = -1	TaskRole-A	FancyRetryPolicy = false MaxRetryCount = -1	Always Retry for any Failed. Never Retry for Succeeded.
Batch with Task Fault Tolerance	FancyRetryPolicy = true MaxRetryCount = 3	TaskRole-A	FancyRetryPolicy = true MaxRetryCount = 3	Always Retry for Transient Failed. Never Retry for Permanent Failed or Succeeded. Retry up to 3 times for Unknown Failed.
Batch without Task Fault Tolerance	FancyRetryPolicy = true MaxRetryCount = 3	TaskRole-A	FancyRetryPolicy = false MaxRetryCount = 0	For Framework RetryPolicy, same as "Batch with Task Fault Tolerance". For Task RetryPolicy, because the Task cannot tolerate any failed TaskAttempt, such as it cannot recover from previous failed TaskAttempt, so Never Retry Task for any Failed or Succeeded.
Debug Mode	FancyRetryPolicy = true MaxRetryCount = 0	TaskRole-A	FancyRetryPolicy = true MaxRetryCount = 0	Always Retry for Transient Failed. Never Retry for Permanent Failed or Unknown Failed or Succeeded. This can help to capture the unexpected exit of user application itself.

FrameworkAttemptCompletionPolicy

Spec

CompletionPolicySpec

Usage

CompletionPolicySpec

Example

Notes:

Italic Conditions can be inherited from the DEFAULT FrameworkAttemptCompletionPolicy, so no need to specify them explicitly.

You still need to specify them explicitly, as we have not supported the Framework Spec Defaulting yet.

FrameworkType	TaskRole	FrameworkAttemptCompletionPolicy	Description
DEFAULT	TaskRole-A	MinFailedTaskCount = 1 MinSucceededTaskCount = -1	The default FrameworkAttemptCompletionPolicy: Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt until all Tasks succeeded.
DEFAULT	TaskRole-B	MinFailedTaskCount = 1 MinSucceededTaskCount = -1
Service	TaskRole-A	MinFailedTaskCount = 1 MinSucceededTaskCount = -1	Actually, any FrameworkAttemptCompletionPolicy is fine, since Service's Task will never complete, i.e. its Task's MaxRetryCount is -2, see RetryPolicy Example.
MapReduce	Map	MinFailedTaskCount = {Map.TaskNumber} * {mapreduce.map.failures.maxpercent} + 1 MinSucceededTaskCount = -1	A few failed Tasks is acceptable, but always want to wait all Tasks to succeed: Fail the FrameworkAttempt immediately if the failed Tasks exceeded the limit. Succeed the FrameworkAttempt until all Tasks completed and the failed Tasks is within the limit.
MapReduce	Reduce	MinFailedTaskCount = {Reduce.TaskNumber} * {mapreduce.reduce.failures.maxpercent} + 1 MinSucceededTaskCount = -1
TensorFlow	ParameterServer	MinFailedTaskCount = 1 MinSucceededTaskCount = -1	Succeed a certain TaskRole is enough, and do not want to wait all Tasks to succeed: Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt immediately if Worker's all Tasks succeeded.
TensorFlow	Worker	MinFailedTaskCount = 1 MinSucceededTaskCount = {Worker.TaskNumber}
Arbitrator Dominated	Arbitrator	MinFailedTaskCount = 1 MinSucceededTaskCount = 1	The FrameworkAttemptCompletionPolicy is fully delegated to the single instance arbitrator of the user application: Fail the FrameworkAttempt immediately if the arbitrator failed. Succeed the FrameworkAttempt immediately if the arbitrator succeeded.
	TaskRole-A	MinFailedTaskCount = -1 MinSucceededTaskCount = -1
	TaskRole-B	MinFailedTaskCount = -1 MinSucceededTaskCount = -1
First Completed Task Dominated	TaskRole-A	MinFailedTaskCount = 1 MinSucceededTaskCount = 1	The FrameworkAttemptCompletionPolicy is fully delegated to the first completed Task of the user application: Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt immediately if any Task succeeded.

Framework and Pod History

By leveraging LogObjectSnapshot, external systems, such as Fluentd and ElasticSearch, can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as persistence, metrics conversion, visualization, alerting, acting, analysis, etc.

Controller Extension

FrameworkBarrier

Usage
Example: FrameworkBarrier Example, TensorFlow Example, etc.

HivedScheduler

Usage
Example: TensorFlow Example, etc.

Best Practice

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

user-manual.md

user-manual.md

User Manual

Index

Framework Interop

Supported Client

Supported Interoperation

CREATE Framework

PATCH Framework

Stop Framework

DELETE Framework

GET Framework

LIST Frameworks

WATCH Framework

WATCH_LIST Frameworks

Container EnvironmentVariable

Pod Failure Classification

Predefined CompletionCode

CompletionStatus

RetryPolicy

Spec

Usage

Example

FrameworkAttemptCompletionPolicy

Spec

Usage

Example

Framework and Pod History

Controller Extension

FrameworkBarrier

HivedScheduler

Best Practice

Files

user-manual.md

Latest commit

History

user-manual.md

File metadata and controls