Stock Benchmarks
PySpur includes several pre-built benchmarks derived from academic and industry standards. These benchmarks help you evaluate your AI workflows against established datasets with known correct answers.Available Benchmarks
GPQA (Google-Proof Questions Answering)
Description: A benchmark designed to test the model’s ability to answer questions that are difficult to answer through simple Google searches. Details:- Type: Reasoning
- Format: Multiple choice
- Paper: GPQA: A Graduate-Level Google-Proof Q&A Benchmark
GSM8K (Grade School Math 8K)
Description: GSM8K is a dataset of 8,000+ elementary math word problems designed to test mathematical reasoning skills. Details:- Type: Reasoning
- Format: Free-form answers (numeric)
- Focus: Step-by-step problem solving for grade school math problems
- Paper: Training Verifiers to Solve Math Word Problems
Math
Description: A dataset of 12,500 math word problems covering various mathematical concepts. Details:- Type: Reasoning
- Format: Free-form answers
- Paper: Measuring Mathematical Problem Solving With the MATH Dataset
MMLU (Massive Multitask Language Understanding)
Description: A comprehensive benchmark testing knowledge across 57 different subjects spanning STEM, humanities, social sciences, and more. Details:- Type: Reasoning
- Format: Multiple choice
- Subject categories:
- STEM (mathematics, physics, chemistry, biology, computer science, etc.)
- Humanities (history, philosophy, law, etc.)
- Social Sciences (psychology, sociology, economics, etc.)
- Other (medicine, nutrition, business, etc.)
- Paper: Measuring Massive Multitask Language Understanding
How to Use Stock Benchmarks
To use these benchmarks for evaluating your workflows:- Navigate to the Evaluations section in PySpur
- Select one of the stock benchmarks
- Choose the workflow you want to evaluate
- Configure sample size (up to the maximum available)
- Launch the evaluation