[go: nahoru, domu]

Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Support FrameworkBarrier for GangExecution and Add Distributed TensorFlow Training Example #2

Merged
merged 9 commits into from
Nov 23, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 4 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,8 @@ A Framework represents an application with a set of Tasks:

## Quick Start
1. [Build](build/frameworkcontroller)
2. [Run Example](example/run/frameworkcontroller)
3. [Config Usage](pkg/apis/frameworkcontroller/v1/config.go)
4. [Config Example](example/config)
5. [Framework Usage](pkg/apis/frameworkcontroller/v1/types.go)
6. [Framework Example](example/framework)
2. [Run Example](example/run/frameworkcontroller.md)
3. [Framework Example](example/framework)

## Doc
1. [User Manual](doc/user-manual.md)
Expand All @@ -76,13 +73,13 @@ A specialized wrapper can be built on top of FrameworkController to optimize for
* [NNI Controller Wrapper](https://github.com/Microsoft/nni)(Developing): A wrapper client optimized for AutoML applications

## Official Image
[FrameworkController DockerHub](https://hub.docker.com/u/frameworkcontroller)
* [DockerHub](https://hub.docker.com/u/frameworkcontroller)

## Related Project
* [YARN FrameworkLauncher](https://github.com/Microsoft/pai/blob/master/subprojects/frameworklauncher/yarn): Similar offering natively supports [Apache YARN](http://hadoop.apache.org)

## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.microsoft.com.

Expand Down
43 changes: 43 additions & 0 deletions bin/frameworkbarrier/start.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash

# MIT License
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE

set -o errexit
set -o nounset
set -o pipefail

BASH_DIR=$(cd $(dirname ${BASH_SOURCE}) && pwd)

cd ${BASH_DIR}

./frameworkbarrier

MNT_DIR=/mnt/frameworkbarrier

mkdir -p ${MNT_DIR}

cp -r ./framework.json ${MNT_DIR}
cp -r ./injector.sh ${MNT_DIR}

echo Succeeded to copy current Framework helper files into ${MNT_DIR}:
cd ${MNT_DIR} && ls -lR .
43 changes: 43 additions & 0 deletions build/frameworkbarrier/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# MIT License
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE

FROM golang:alpine as builder

ENV PROJECT_DIR=${GOPATH}/src/github.com/microsoft/frameworkcontroller
ENV INSTALL_DIR=/opt/frameworkcontroller/frameworkbarrier

RUN apk update && apk add --no-cache bash && \
mkdir -p ${PROJECT_DIR} ${INSTALL_DIR}
COPY . ${PROJECT_DIR}
RUN ${PROJECT_DIR}/build/frameworkbarrier/go-build.sh && \
mv ${PROJECT_DIR}/dist/frameworkbarrier/* ${INSTALL_DIR}


FROM alpine:latest

ENV INSTALL_DIR=/opt/frameworkcontroller/frameworkbarrier

RUN apk update && apk add --no-cache bash
COPY --from=builder ${INSTALL_DIR} ${INSTALL_DIR}
WORKDIR ${INSTALL_DIR}

ENTRYPOINT ["./start.sh"]
xudifsd marked this conversation as resolved.
Show resolved Hide resolved
37 changes: 37 additions & 0 deletions build/frameworkbarrier/docker-build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/bash

# MIT License
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE

set -o errexit
set -o nounset
set -o pipefail

BASH_DIR=$(cd $(dirname ${BASH_SOURCE}) && pwd)
PROJECT_DIR=${BASH_DIR}/../..
IMAGE_NAME=frameworkbarrier

cd ${PROJECT_DIR}

docker build -t ${IMAGE_NAME} -f ${BASH_DIR}/Dockerfile .
yqwang-ms marked this conversation as resolved.
Show resolved Hide resolved

echo Succeeded to build docker image ${IMAGE_NAME}
44 changes: 44 additions & 0 deletions build/frameworkbarrier/go-build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash

# MIT License
#
# Copyright (c) Microsoft Corporation. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE

set -o errexit
set -o nounset
set -o pipefail

BASH_DIR=$(cd $(dirname ${BASH_SOURCE}) && pwd)
# Ensure ${PROJECT_DIR} is ${GOPATH}/src/github.com/microsoft/frameworkcontroller
PROJECT_DIR=${BASH_DIR}/../..
DIST_DIR=${PROJECT_DIR}/dist/frameworkbarrier

cd ${PROJECT_DIR}

rm -rf ${DIST_DIR}
mkdir -p ${DIST_DIR}

go build -o ${DIST_DIR}/frameworkbarrier cmd/frameworkbarrier/*
yqwang-ms marked this conversation as resolved.
Show resolved Hide resolved
chmod a+x ${DIST_DIR}/frameworkbarrier
cp -r bin/frameworkbarrier/* ${DIST_DIR}

echo Succeeded to build binary distribution into ${DIST_DIR}:
cd ${DIST_DIR} && ls -lR .
19 changes: 14 additions & 5 deletions build/frameworkcontroller/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,24 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE

FROM golang:alpine
FROM golang:alpine as builder

ENV PROJECT_DIR=${GOPATH}/src/github.com/microsoft/frameworkcontroller
ENV INSTALL_DIR=/opt/frameworkcontroller/frameworkcontroller

RUN apk update && apk add bash && mkdir -p ${PROJECT_DIR}
RUN apk update && apk add --no-cache bash && \
mkdir -p ${PROJECT_DIR} ${INSTALL_DIR}
COPY . ${PROJECT_DIR}
WORKDIR ${PROJECT_DIR}
RUN ${PROJECT_DIR}/build/frameworkcontroller/go-build.sh && \
mv ${PROJECT_DIR}/dist/frameworkcontroller/* ${INSTALL_DIR}

RUN ./build/frameworkcontroller/go-build.sh
WORKDIR ./dist/frameworkcontroller

FROM alpine:latest

ENV INSTALL_DIR=/opt/frameworkcontroller/frameworkcontroller

RUN apk update && apk add --no-cache bash
COPY --from=builder ${INSTALL_DIR} ${INSTALL_DIR}
WORKDIR ${INSTALL_DIR}

ENTRYPOINT ["./start.sh"]
2 changes: 1 addition & 1 deletion build/frameworkcontroller/go-build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ mkdir -p ${DIST_DIR}
go build -o ${DIST_DIR}/frameworkcontroller cmd/frameworkcontroller/*
chmod a+x ${DIST_DIR}/frameworkcontroller
cp -r bin/frameworkcontroller/* ${DIST_DIR}
cp -r example/config/default/* ${DIST_DIR}
cp -r example/config/default/frameworkcontroller.yaml ${DIST_DIR}

echo Succeeded to build binary distribution into ${DIST_DIR}:
cd ${DIST_DIR} && ls -lR .
36 changes: 36 additions & 0 deletions cmd/frameworkbarrier/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
// MIT License
//
// Copyright (c) Microsoft Corporation. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE

package main

import (
"github.com/microsoft/frameworkcontroller/pkg/common"
"github.com/microsoft/frameworkcontroller/pkg/barrier"
)

func init() {
common.InitAll()
}

func main() {
barrier.NewFrameworkBarrier().Run()
}
2 changes: 1 addition & 1 deletion cmd/frameworkcontroller/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ func main() {
stopCh := make(chan struct{})
defer close(stopCh)

go controller.NewQueueFrameworkController().Run(stopCh)
go controller.NewFrameworkController().Run(stopCh)

sigTerm := make(chan os.Signal, 1)
signal.Notify(sigTerm, syscall.SIGTERM)
Expand Down
1 change: 0 additions & 1 deletion doc/known-issue-and-upcoming-feature.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
Tracked in [Dashboard errors if pod's owner reference is not supported](https://github.com/kubernetes/dashboard/issues/3251)

## <a name="UpcomingFeature">Upcoming Feature</a>
- [ ] Add Distributed TensorFlow Training Example
- [ ] Support Framework Spec Update
- [ ] Support Framework Spec Validation and Defaulting
- [ ] Support Framework Status Subresource
Expand Down
9 changes: 8 additions & 1 deletion doc/user-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
- [CompletionCode Convention](#CompletionCodeConvention)
- [RetryPolicy](#RetryPolicy)
- [FrameworkAttemptCompletionPolicy](#FrameworkAttemptCompletionPolicy)
- [Controller Extension](#ControllerExtension)
- [Best Practice](#BestPractice)

## <a name="FrameworkInterop">Framework Interop</a>
Expand Down Expand Up @@ -146,7 +147,7 @@ Watch the change events of all Frameworks (in the specified FrameworkNamespace).
[Container EnvironmentVariable](../pkg/apis/frameworkcontroller/v1/constants.go)

## <a name="CompletionCodeConvention">CompletionCode Convention</a>
[CompletionCode Convention](../pkg/apis/frameworkcontroller/v1/types.go)
[CompletionCode Convention](../pkg/apis/frameworkcontroller/v1/constants.go)

## <a name="RetryPolicy">RetryPolicy</a>
### <a name="RetryPolicy_Spec">Spec</a>
Expand Down Expand Up @@ -301,5 +302,11 @@ Notes:
</tbody>
</table>

## <a name="ControllerExtension">Controller Extension</a>
### <a name="FrameworkBarrier">FrameworkBarrier</a>
1. [Usage](../pkg/barrier/barrier.go)
2. [Build](../build/frameworkbarrier)
3. Example: [FrameworkBarrier Example](../example/framework/extension/frameworkbarrier.yaml), [Tensorflow Example](../example/framework/scenario/tensorflow), [etc](../example/framework/scenario).

## <a name="BestPractice">Best Practice</a>
[Best Practice](../pkg/apis/frameworkcontroller/v1/types.go)
44 changes: 22 additions & 22 deletions example/framework/basic/batchfailedpermanent.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,25 +10,25 @@ spec:
fancyRetryPolicy: true
maxRetryCount: 1
taskRoles:
- name: worker
taskNumber: 1
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: -1
task:
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 1
pod:
spec:
restartPolicy: Never
containers:
- name: ubuntu
image: ubuntu:trusty
# See CompletionCode Convention in
# ./pkg/apis/frameworkcontroller/v1/constants.go
command: [
"sh", "-c",
"sleep 10 &&
echo exit with permanent failure to tell controller not to retry &&
exit 210"]
- name: worker
taskNumber: 1
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: -1
task:
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 1
pod:
spec:
restartPolicy: Never
containers:
- name: ubuntu
image: ubuntu:trusty
# See CompletionCode Convention in
# ./pkg/apis/frameworkcontroller/v1/constants.go
command: [
"sh", "-c",
"sleep 10 &&
echo exit with permanent failure to tell controller not to retry &&
exit 210"]
Loading