[go: nahoru, domu]

Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Latest commit

 

History

History
381 lines (302 loc) · 18.7 KB

user-manual.md

File metadata and controls

381 lines (302 loc) · 18.7 KB

As Framework is actually a Kubernetes CRD, all CRD Clients can be used to interoperate with it, such as:

  1. kubectl
    kubectl create -f {Framework File Path}
    # Note this is not Foreground Deletion, see [DELETE Framework] section
    kubectl delete framework {FrameworkName}
    kubectl get framework {FrameworkName}
    kubectl describe framework {FrameworkName}
    kubectl get frameworks
    kubectl describe frameworks
    ...
  2. Kubernetes Client Library
  3. Any HTTP Client
API Kind Operations
Framework CREATE PATCH DELETE GET LIST WATCH WATCH_LIST
ConfigMap All operations except for CREATE PUT PATCH
Pod All operations except for CREATE PUT PATCH

Request

POST /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks

Body: Framework

Type: application/json or application/yaml

Description

Create the specified Framework.

Response

Code Body Description
OK(200) Framework Return current Framework.
Created(201) Framework Return current Framework.
Accepted(202) Framework Return current Framework.
Conflict(409) Status The specified Framework already exists.

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

{
  "spec": {
    "executionType": "Stop"
  }
}

Type: application/merge-patch+json

Description

Stop the specified Framework:

All running containers of the Framework will be stopped while the object of the Framework is still kept.

Response

Code Body Description
OK(200) Framework Return current Framework.
NotFound(404) Status The specified Framework is not found.

Request

DELETE /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

application/json

{
  "propagationPolicy": "Foreground"
}

application/yaml

propagationPolicy: Foreground

Type: application/json or application/yaml

Description

Delete the specified Framework.

Notes:

  • If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the Foreground Deletion in the provided body, see Framework Notes. However, kubectl delete does not support to specify the Foreground Deletion at least for Kubernetes v1.14.2, so you may have to use other Supported Client.

Response

Code Body Description
OK(200) Framework The specified Framework is deleting.
Return current Framework.
OK(200) Status The specified Framework is deleted.
NotFound(404) Status The specified Framework is not found.

Request

GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Description

Get the specified Framework.

Response

Code Body Description
OK(200) Framework Return current Framework.
NotFound(404) Status The specified Framework is not found.

Request

GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/frameworks

QueryParameters: Same as StatefulSet QueryParameters

Description

Get all Frameworks (in the specified FrameworkNamespace).

Response

Code Body Description
OK(200) FrameworkList Return all Frameworks (in the specified FrameworkNamespace).

Request

GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

QueryParameters: Same as StatefulSet QueryParameters

Description

Watch the change events of the specified Framework.

Response

Code Body Description
OK(200) WatchEvent Streaming the change events of the specified Framework.
NotFound(404) Status The specified Framework is not found.

Request

GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/watch/frameworks

QueryParameters: Same as StatefulSet QueryParameters

Description

Watch the change events of all Frameworks (in the specified FrameworkNamespace).

Response

Code Body Description
OK(200) WatchEvent Streaming the change events of all Frameworks (in the specified FrameworkNamespace).

Container EnvironmentVariable

You can specify how to classify and summarize Pod failures by PodFailureSpec.

You can leverage the Predefined CompletionCode to instruct your RetryPolicy and identify a certain predefined CompletionCode, regardless of different PodFailureSpec may be configured in different clusters.

CompletionStatus: It is generated from Predefined CompletionCode or PodPattern matching. For a Pod, if no PodPattern is matched and failed Container exists, the CompletionCode is the same as the last failed Container ExitCode.

TaskAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a TaskAttempt.

FrameworkAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a FrameworkAttempt.

RetryPolicySpec

RetryPolicySpec

Notes:

  1. Italic Conditions can be inherited from the DEFAULT RetryPolicy, so no need to specify them explicitly.

    You still need to specify them explicitly, as we have not supported the Framework Spec Defaulting yet.

  2. For the definition of each CompletionType, such as Transient Failed, see CompletionStatus.

FrameworkType Framework RetryPolicy TaskRole Task RetryPolicy Description
DEFAULT FancyRetryPolicy = false
MaxRetryCount = 0
TaskRole-A FancyRetryPolicy = false
MaxRetryCount = 0
The default RetryPolicy:
Never Retry for any Failed or Succeeded.
TaskRole-B FancyRetryPolicy = false
MaxRetryCount = 0
Service FancyRetryPolicy = false
MaxRetryCount = -2
TaskRole-A FancyRetryPolicy = false
MaxRetryCount = -2
Always Retry for any Failed or Succeeded.
Blind Batch FancyRetryPolicy = false
MaxRetryCount = -1
TaskRole-A FancyRetryPolicy = false
MaxRetryCount = -1
Always Retry for any Failed.
Never Retry for Succeeded.
Batch with Task Fault Tolerance FancyRetryPolicy = true
MaxRetryCount = 3
TaskRole-A FancyRetryPolicy = true
MaxRetryCount = 3
Always Retry for Transient Failed.
Never Retry for Permanent Failed or Succeeded.
Retry up to 3 times for Unknown Failed.
Batch without Task Fault Tolerance FancyRetryPolicy = true
MaxRetryCount = 3
TaskRole-A FancyRetryPolicy = false
MaxRetryCount = 0
For Framework RetryPolicy, same as "Batch with Task Fault Tolerance".
For Task RetryPolicy, because the Task cannot tolerate any failed TaskAttempt, such as it cannot recover from previous failed TaskAttempt, so Never Retry Task for any Failed or Succeeded.
Debug Mode FancyRetryPolicy = true
MaxRetryCount = 0
TaskRole-A FancyRetryPolicy = true
MaxRetryCount = 0
Always Retry for Transient Failed.
Never Retry for Permanent Failed or Unknown Failed or Succeeded.
This can help to capture the unexpected exit of user application itself.

CompletionPolicySpec

CompletionPolicySpec

Notes:

  1. Italic Conditions can be inherited from the DEFAULT FrameworkAttemptCompletionPolicy, so no need to specify them explicitly.

    You still need to specify them explicitly, as we have not supported the Framework Spec Defaulting yet.

FrameworkType TaskRole FrameworkAttemptCompletionPolicy Description
DEFAULT TaskRole-A MinFailedTaskCount = 1
MinSucceededTaskCount = -1
The default FrameworkAttemptCompletionPolicy:
Fail the FrameworkAttempt immediately if any Task failed.
Succeed the FrameworkAttempt until all Tasks succeeded.
TaskRole-B MinFailedTaskCount = 1
MinSucceededTaskCount = -1
Service TaskRole-A MinFailedTaskCount = 1
MinSucceededTaskCount = -1
Actually, any FrameworkAttemptCompletionPolicy is fine, since Service's Task will never complete, i.e. its Task's MaxRetryCount is -2, see RetryPolicy Example.
MapReduce Map MinFailedTaskCount = {Map.TaskNumber} * {mapreduce.map.failures.maxpercent} + 1
MinSucceededTaskCount = -1
A few failed Tasks is acceptable, but always want to wait all Tasks to succeed:
Fail the FrameworkAttempt immediately if the failed Tasks exceeded the limit.
Succeed the FrameworkAttempt until all Tasks completed and the failed Tasks is within the limit.
Reduce MinFailedTaskCount = {Reduce.TaskNumber} * {mapreduce.reduce.failures.maxpercent} + 1
MinSucceededTaskCount = -1
TensorFlow ParameterServer MinFailedTaskCount = 1
MinSucceededTaskCount = -1
Succeed a certain TaskRole is enough, and do not want to wait all Tasks to succeed:
Fail the FrameworkAttempt immediately if any Task failed.
Succeed the FrameworkAttempt immediately if Worker's all Tasks succeeded.
Worker MinFailedTaskCount = 1
MinSucceededTaskCount = {Worker.TaskNumber}
Arbitrator Dominated Arbitrator MinFailedTaskCount = 1
MinSucceededTaskCount = 1
The FrameworkAttemptCompletionPolicy is fully delegated to the single instance arbitrator of the user application:
Fail the FrameworkAttempt immediately if the arbitrator failed.
Succeed the FrameworkAttempt immediately if the arbitrator succeeded.
TaskRole-A MinFailedTaskCount = -1
MinSucceededTaskCount = -1
TaskRole-B MinFailedTaskCount = -1
MinSucceededTaskCount = -1
First Completed Task Dominated TaskRole-A MinFailedTaskCount = 1
MinSucceededTaskCount = 1
The FrameworkAttemptCompletionPolicy is fully delegated to the first completed Task of the user application:
Fail the FrameworkAttempt immediately if any Task failed.
Succeed the FrameworkAttempt immediately if any Task succeeded.

By leveraging LogObjectSnapshot, external systems, such as Fluentd and ElasticSearch, can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as persistence, metrics conversion, visualization, alerting, acting, analysis, etc.

  1. Usage
  2. Example: FrameworkBarrier Example, TensorFlow Example, etc.
  1. Usage
  2. Example: TensorFlow Example, etc.

Best Practice