Evaluation notebooks

We provide several examples of how you can use the rapid evaluation SDK to perform evaluations on your generative AI models.

Evaluate your models in real time

The Vertex AI rapid evaluation service lets you evaluate your generative AI models in real time. To learn how to use rapid evaluation, see Run a rapid evaluation.

For an end-to-end example, see the colab notebook for the Vertex AI SDK for Python with rapid evaluation.

Evaluate and optimize prompt template design

Use the rapid evaluation SDK to evaluate the effect of prompt engineering. Examine the statistics corresponding with each prompt template to understand how differences in prompting impacts evaluation results.

For an end-to-end example, see the notebook Evaluate and Optimize Prompt Template Design for Better Results.

Evaluate and select LLM models using benchmark metrics

Use the rapid evaluation SDK to score both Gemini Pro and Text Bison models on a benchmark dataset and a task.

For an end-to-end example, see the notebook Score and Select LLM Models.

Evaluate and select model-generation settings

Use the rapid evaluation SDK to adjust the temperature of Gemini Pro on a summarization task and to evaluate quality, fluency, safety, and verbosity.

For an end-to-end example, see the notebook Evaluate and Select Model Generation Settings.

Define your metrics

Use the rapid evaluation SDK to evaluate multiple prompt templates with your custom defined metrics.

For an end-to-end example, see the notebook Define Your Own Metrics.

Evaluate tool use and function calling

Use the rapid evaluation SDK to define an API function and a tool for the Gemini model. You can also use the SDK to evaluate tool use and function-calling quality for Gemini.

For an end-to-end example, see the notebook Evaluate Generative Model Tool Use and Function Calling.

Evaluate generated answers from RAG for question answering

Use the rapid evaluation SDK to evaluate a question-answering task from Retrieval-Augmented Generation (RAG) generated answers.

For an end-to-end example, see the notebook Evaluate Generated Answers from RAG for Question Answering.

Evaluate an LLM in Vertex AI Model Registry against a third-party model

Use AutoSxS to evaluate responses between two models and determine a winner. You can either provide the responses or generate them using Vertex AI Batch Predictions.

For an end-to-end example, see the notebook Evaluate a LLM in Vertex AI Model Registry against a third-party model.

Check autorater alignment against a human-preference dataset

Use AutoSxS to check how well autorater ratings align with a set of human ratings you provide for a particular task. Determine if AutoSxS is sufficient for your use case, or if it needs further customization.

For an end-to-end example, see the notebook Check autorater alignment against a human-preference dataset.

What's next