If you're writing your own training code instead of using AutoML, there are several ways of doing custom training to consider. This topic provides a brief overview and comparison of the different ways you can run custom training.
Custom training resources on Vertex AI
There are three types of Vertex AI resources you can create to train custom models on Vertex AI:
When you create a custom job, you specify settings that Vertex AI needs to run your training code, including:
- One worker pool for single-node training (
WorkerPoolSpec
), or multiple worker pools for distributed training - Optional settings for configuring job scheduling (
Scheduling
), setting certain environment variables for your training code, using a custom service account, and using VPC Network Peering
Within the worker pool(s), you can specify the following settings:
- Machine types and accelerators
- Configuration of what type of training code the worker pool
runs: either a Python training
application (
PythonPackageSpec
) or a custom container (ContainerSpec
)
Hyperparameter tuning jobs have additional settings to configure, such as the metric. Learn more about hyperparameter tuning.
A training pipeline orchestrates custom training jobs or hyperparameter tuning jobs with additional steps, such as loading a dataset or uploading the model to Vertex AI after the training job is successfully completed.
Custom training resources
To view existing training pipelines in your project, go to the Training Pipelines page in the Vertex AI section of the Google Cloud console.
To view existing custom jobs in your project, go to the Custom jobs page.
To view existing hyperparameter tuning jobs in your project, go to the Hyperparameter tuning page.
Prebuilt and custom containers
Before you submit a custom training job, hyperparameter tuning job, or a training pipeline to Vertex AI, you need to create a Python training application or a custom container to define the training code and dependencies you want to run on Vertex AI. If you create a Python training application using TensorFlow, PyTorch, scikit-learn, or XGBoost, you can use our prebuilt containers to run your code. If you're not sure which of these options to choose, refer to the training code requirements to learn more.
Distributed training
You can configure a custom training job, hyperparameter tuning job, or a training pipeline for distributed training by specifying multiple worker pools:
- Use your first worker pool to configure your primary replica, and set the replica count to 1.
- Add more worker pools to configure worker replicas, parameter server replicas, or evaluator replicas, if your machine learning framework supports these additional cluster tasks for distributed training.
Learn more about using distributed training.
What's next
- Learn how to create a persistent resource to run custom training jobs.
- See Create custom training jobs to learn how to create custom training jobs to run your custom training applications on Vertex AI.
- See Create training pipelines to learn how to create training pipelines to run custom training applications on Vertex AI.
- See Use hyperparameter tuning to learn about Hyperparameter tuning searches.