PySpur includes several pre-built benchmarks derived from academic and industry standards. These benchmarks help you evaluate your AI workflows against established datasets with known correct answers.
This benchmark evaluates a model’s deep reasoning capabilities across challenging questions. The model must select from four possible answers (A, B, C, D).
To use these benchmarks for evaluating your workflows:
Navigate to the Evaluations section in PySpur
Select one of the stock benchmarks
Choose the workflow you want to evaluate
Configure sample size (up to the maximum available)
Launch the evaluation
After completion, you’ll receive detailed metrics on your workflow’s performance, including overall accuracy, per-category breakdowns, and example-level results.
While stock benchmarks provide a good starting point, the real power of PySpur’s evaluation system comes when using data that matches your specific use cases. You can create custom evaluations based on your own datasets by following the same YAML structure as our stock benchmarks.