CS230 Project - Distributed Job Scheduling System in Machine Learning Clusters

Setup

For scheduler and worker

Install conda virtual environment.
Run conda env create --prefix cs230 -f configuration.yaml under worker directory.
Activate the environment.
Install common library with pip install -e . command under common directory.
Install torch and torchvision with pip.

For FTP server and RabbitMQ broker

Deploy with docker from docker hub:

rabbitmq:3.12-management

garethflowers/ftp-server

For users

Install common library with pip install -e . command under common directory.

`config.json`

The RabbitMQ broker, FTP server, and the GPU capacity of each worker should be set up in the worker/config.json file.

    "broker" : {
        "broker_host": "18.119.97.104",
        "broker_port": "5673",
        "topics": {
            "broker_scheduling_topic": "node_1_scheduling"
        }
    },
    ...
    "ftp" : {
        "ftp_host": "169.234.56.23",
        "ftp_port": "21"
    },
    "workers": {
        "1": {
            "GPU": 8589934592
        },
        "2": {
            "GPU": 2147483648
        },
        "3": {
            "GPU": 0
        }
    },
    ...

Launch

Scheduler

There are three scheduling algorithm available:

python scheduler.py [next-available, round-robin, priority-based]

e.g.

python scheduler.py next-available

Worker

python daemon.py [worker_id]

e.g.

python daemon.py 1

Workflow

Please look at tester.py to learn how a user should send a new task request to scheduler and retrieve the results.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
common		common
db_file		db_file
profiler		profiler
report		report
scheduler		scheduler
worker		worker
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt
resnset_train.py		resnset_train.py
tester.py		tester.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS230 Project - Distributed Job Scheduling System in Machine Learning Clusters

Setup

`config.json`

Launch

Workflow

About

Releases

Packages

Contributors 4

Languages

qqaatw/cs230-project

Folders and files

Latest commit

History

Repository files navigation

CS230 Project - Distributed Job Scheduling System in Machine Learning Clusters

Setup

config.json

Launch

Workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

`config.json`

Packages