🧪 New evals for multi-turn and SQL-based use cases!

New

We’ve added a new set of evaluators (LLM-as-a-judge and statistical) to help you ship high-quality AI applications, with a strong focus on evals for agentic and NL-to-SQL workflows. Key highlights:

Multi-turn evals: Evaluate if an agent successfully completes user tasks, makes correct tool choices, executes and completes the required steps, and follows the correct trajectory to achieve user goals.
SQL evals: Validate the syntax and adherence to DB schema, and evaluate the correctness of SQL queries generated from natural language input.
Tool call evals: Check whether the model selected the correct tool with the right parameters, and measure how accurately it called the expected tools.

You can add these to your workspaces from the Evaluator Store and start using!