Stock Benchmarks

PySpur includes several pre-built benchmarks derived from academic and industry standards. These benchmarks help you evaluate your AI workflows against established datasets with known correct answers.

Available Benchmarks

GPQA (Google-Proof Questions Answering)

Description: A benchmark designed to test the model’s ability to answer questions that are difficult to answer through simple Google searches. Details:

Type: Reasoning
Format: Multiple choice
Paper: GPQA: A Graduate-Level Google-Proof Q&A Benchmark

This benchmark evaluates a model’s deep reasoning capabilities across challenging questions. The model must select from four possible answers (A, B, C, D).

GSM8K (Grade School Math 8K)

Description: GSM8K is a dataset of 8,000+ elementary math word problems designed to test mathematical reasoning skills. Details:

Type: Reasoning
Format: Free-form answers (numeric)
Focus: Step-by-step problem solving for grade school math problems
Paper: Training Verifiers to Solve Math Word Problems

The model is prompted to solve each problem step by step, with the evaluation focused on extracting the final numeric answer.

Math

Description: A dataset of 12,500 math word problems covering various mathematical concepts. Details:

Type: Reasoning
Format: Free-form answers
Paper: Measuring Mathematical Problem Solving With the MATH Dataset

This benchmark requires models to work through math problems step by step and provide a clear final answer.

MMLU (Massive Multitask Language Understanding)

Description: A comprehensive benchmark testing knowledge across 57 different subjects spanning STEM, humanities, social sciences, and more. Details:

Type: Reasoning
Format: Multiple choice
Subject categories:
- STEM (mathematics, physics, chemistry, biology, computer science, etc.)
- Humanities (history, philosophy, law, etc.)
- Social Sciences (psychology, sociology, economics, etc.)
- Other (medicine, nutrition, business, etc.)
Paper: Measuring Massive Multitask Language Understanding

MMLU evaluates a model’s breadth of knowledge across academic and professional subjects, requiring both factual recall and reasoning.

How to Use Stock Benchmarks

To use these benchmarks for evaluating your workflows:

Navigate to the Evaluations section in PySpur
Select one of the stock benchmarks
Choose the workflow you want to evaluate
Configure sample size (up to the maximum available)
Launch the evaluation

After completion, you’ll receive detailed metrics on your workflow’s performance, including overall accuracy, per-category breakdowns, and example-level results.

Customizing Evaluations

While stock benchmarks provide a good starting point, the real power of PySpur’s evaluation system comes when using data that matches your specific use cases. You can create custom evaluations based on your own datasets by following the same YAML structure as our stock benchmarks.

Get Started

Chatbots

Tools

RAG

Evals

API Reference

Stock Benchmarks

Stock Benchmarks

Available Benchmarks

GPQA (Google-Proof Questions Answering)

GSM8K (Grade School Math 8K)

Math

MMLU (Massive Multitask Language Understanding)

How to Use Stock Benchmarks

Customizing Evaluations

Get Started

Chatbots

Tools

RAG

Evals

API Reference

​Stock Benchmarks

​Available Benchmarks

​GPQA (Google-Proof Questions Answering)

​GSM8K (Grade School Math 8K)

​Math

​MMLU (Massive Multitask Language Understanding)

​How to Use Stock Benchmarks

​Customizing Evaluations

Stock Benchmarks

Available Benchmarks

GPQA (Google-Proof Questions Answering)

GSM8K (Grade School Math 8K)

Math

MMLU (Massive Multitask Language Understanding)

How to Use Stock Benchmarks

Customizing Evaluations