Evaluations API

This document outlines the API endpoints for managing evaluations in PySpur.

List Available Evaluations

Description: Lists all available evaluations by scanning the tasks directory for YAML files. Returns metadata about each evaluation including name, description, type, and number of samples.

URL: /evals/

Method: GET

Response Schema:

List[Dict[str, Any]]

Each dictionary in the list contains:

{
    "name": str,  # Name of the evaluation
    "description": str,  # Description of the evaluation
    "type": str,  # Type of evaluation
    "num_samples": str,  # Number of samples in the evaluation
    "paper_link": str,  # Link to the paper describing the evaluation
    "file_name": str  # Name of the YAML file
}

Launch Evaluation

Description: Launches an evaluation job by triggering the evaluator with the specified evaluation configuration. The evaluation is run asynchronously in the background.

URL: /evals/launch/

Method: POST

Request Payload:

class EvalRunRequest:
    eval_name: str  # Name of the evaluation to run
    workflow_id: str  # ID of the workflow to evaluate
    output_variable: str  # Output variable to evaluate
    num_samples: int = 100  # Number of random samples to evaluate

Response Schema:

class EvalRunResponse:
    run_id: str  # ID of the evaluation run
    eval_name: str  # Name of the evaluation
    workflow_id: str  # ID of the workflow being evaluated
    status: EvalRunStatusEnum  # Status of the evaluation run
    start_time: datetime  # When the evaluation started
    end_time: Optional[datetime]  # When the evaluation ended (if completed)
    results: Optional[Dict[str, Any]]  # Results of the evaluation (if completed)

Get Evaluation Run Status

Description: Gets the status of a specific evaluation run, including results if the evaluation has completed.

URL: /evals/runs/{eval_run_id}

Method: GET

Parameters:

eval_run_id: str  # ID of the evaluation run

Response Schema:

class EvalRunResponse:
    run_id: str  # ID of the evaluation run
    eval_name: str  # Name of the evaluation
    workflow_id: str  # ID of the workflow being evaluated
    status: EvalRunStatusEnum  # Status of the evaluation run
    start_time: datetime  # When the evaluation started
    end_time: Optional[datetime]  # When the evaluation ended (if completed)
    results: Optional[Dict[str, Any]]  # Results of the evaluation (if completed)

List Evaluation Runs

Description: Lists all evaluation runs, ordered by start time descending.

URL: /evals/runs/

Method: GET

Response Schema:

List[EvalRunResponse]

Where EvalRunResponse contains:

class EvalRunResponse:
    run_id: str  # ID of the evaluation run
    eval_name: str  # Name of the evaluation
    workflow_id: str  # ID of the workflow being evaluated
    status: EvalRunStatusEnum  # Status of the evaluation run
    start_time: datetime  # When the evaluation started
    end_time: Optional[datetime]  # When the evaluation ended (if completed)
    results: Optional[Dict[str, Any]]  # Results of the evaluation (if completed)