Skip to main content

Metrics

Metrics define how model performance is measured and evaluated.

Custom Logic

Define custom evaluation functions that match your specific business requirements.

Flexible Inputs

Accept any input format and compare against expected outputs flexibly.

Aggregation Support

Aggregate individual scores across datasets for comprehensive evaluation.

Optimization Ready

Use metrics directly with Tune for automatic prompt optimization.

Creating Metrics

Choose from four available options when creating metrics:

Auto

Auto metric creation How to use: Select fields from your dataset schema to create exact match metrics. Required:
  • Dataset with defined schema containing the fields you want to compare
  • Select specific fields by clicking the “Select” button next to each field name

Code

Code metric creation How to use: Write Python code using the provided template. Required:
  • Define a function called metric_func(output, expected) that returns a float value (typically 0.0 or 1.0)
  • Replace 'field_name' placeholders with your actual field names
  • Function must handle None/missing values and return appropriate scores

Existing

Existing metric selection How to use: Select from previously created metrics in your project. Required:
  • At least one metric must already exist in your project
  • Select the desired metric from the list by checking the checkbox

LLM

LLM judge metric creation How to use: Enter evaluation criteria and instructions for an LLM judge. Required:
  • Write evaluation criteria and instructions in the text area
  • Your prompt must instruct the LLM to return either ‘true’ or ‘false’ in its response
  • LLM judge receives both the model output and expected output for comparison

Optimize Prompts

Let Tune automatically improve prompts based on your metrics
I