Concepts
Learn how PySpur helps you measure the performance of your AI workflows
Understanding Evaluations in PySpur
Evaluation is the process of measuring how well your AI workflows perform against objective benchmarks. Instead of guessing if your workflow is doing a good job, evaluations provide quantitative metrics so you can:
- Measure the accuracy of your workflow’s outputs
- Compare different versions of your workflows
- Identify areas for improvement
- Build trust in your AI systems
Why Evaluate?
Without evaluation, it’s difficult to know if your AI systems are performing as expected. Evaluations help you:
- Verify accuracy: Ensure your workflows produce correct answers
- Track improvement: Measure progress as you refine your workflows
- Compare approaches: Determine which techniques work best
- Build confidence: Provide evidence of your system’s capabilities
How Evaluations Work in PySpur
The evaluation process in PySpur has three main components:
1. Evaluation Benchmarks
PySpur includes pre-built benchmarks from academic and industry standards. Each benchmark:
- Contains a dataset of problems with known correct answers
- Specifies how to format inputs for your workflow
- Defines how to extract and evaluate outputs from your workflow
For demonstration purposes, we provide some stock benchmarks for:
- Mathematical reasoning (GSM8K)
- Graduate-level Question answering
But the real power of evals will be unlocked when used with data matching your use cases.
2. Your Workflow
You connect your existing PySpur workflow to the evaluation system. The workflow:
- Receives inputs from the evaluation dataset
- Processes them through your custom logic and AI components
- Returns outputs that will be compared against the ground truth
3. Results and Metrics
After running an evaluation, PySpur provides detailed metrics:
- Accuracy: The percentage of correct answers
- Per-category breakdowns: How performance varies across problem types
- Example-level results: Which specific examples succeeded or failed
- Visualizations: Charts and graphs to help interpret results
The Evaluation Workflow in PySpur
Here’s how to run an evaluation in PySpur:
-
Choose an Evaluation Benchmark
- Browse the available evaluation benchmarks
- Review the description, problem type, and sample size
-
Select a Workflow to Evaluate
- Choose which of your workflows to test
- Select the specific output variable to evaluate
-
Configure the Evaluation
- Choose how many samples to evaluate (up to the max available)
- Launch the evaluation job
-
Review Results
- Monitor the evaluation progress in real-time
- Once completed, view detailed accuracy metrics
- Analyze per-example results to identify patterns in errors
Example Evaluation Results
Here’s what evaluation results typically look like:
These results show:
- Overall accuracy across all samples
- A breakdown of performance by category
- Individual examples with their outputs and correctness
- Patterns in what your model gets right or wrong
Best Practices for Evaluation
For reliable evaluation results in PySpur:
- Use appropriate benchmarks: Choose evaluations that match your workflow’s purpose
- Select enough samples: Use more samples for more reliable results
- Choose the right output variable: Make sure you’re evaluating the right part of your workflow
- Iterate based on results: Use the findings to improve your workflow
- Compare systematically: When testing different approaches, keep other variables constant
Technical Details
Behind the scenes, PySpur’s evaluation system:
- Loads the evaluation dataset (typically from YAML configuration files)
- Runs your workflow on each example in the dataset
- Extracts answers from your workflow’s output
- Compares the predicted answers to ground truth using task-specific criteria
- Calculates metrics like accuracy, both overall and by category
- Stores results for future reference