CouncilAsAJudge¶

The CouncilAsAJudge is a sophisticated evaluation system that employs multiple AI agents to assess model responses across various dimensions. It provides comprehensive, multi-dimensional analysis of AI model outputs through parallel evaluation and aggregation.

Overview¶

The CouncilAsAJudge implements a council of specialized AI agents that evaluate different aspects of a model's response. Each agent focuses on a specific dimension of evaluation, and their findings are aggregated into a comprehensive report.

graph TD
    A[User Query] --> B[Base Agent]
    B --> C[Model Response]
    C --> D[CouncilAsAJudge]

    subgraph "Evaluation Dimensions"
        D --> E1[Accuracy Agent]
        D --> E2[Helpfulness Agent]
        D --> E3[Harmlessness Agent]
        D --> E4[Coherence Agent]
        D --> E5[Conciseness Agent]
        D --> E6[Instruction Adherence Agent]
    end

    E1 --> F[Evaluation Aggregation]
    E2 --> F
    E3 --> F
    E4 --> F
    E5 --> F
    E6 --> F

    F --> G[Comprehensive Report]

    style D fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#bbf,stroke:#333,stroke-width:2px

Key Features¶

Parallel evaluation across multiple dimensions
Caching system for improved performance
Dynamic model selection
Comprehensive evaluation metrics
Thread-safe execution
Detailed technical analysis

Installation¶

pip install swarms

Basic Usage¶

from swarms import Agent, CouncilAsAJudge

# Create a base agent
base_agent = Agent(
    agent_name="Financial-Analysis-Agent",
    system_prompt="You are a financial expert helping users understand and establish ROTH IRAs.",
    model_name="claude-opus-4-20250514",
    max_loops=1,
)

# Run the base agent
user_query = "How can I establish a ROTH IRA to buy stocks and get a tax break?"
model_output = base_agent.run(user_query)

# Create and run the council
panel = CouncilAsAJudge()
results = panel.run(user_query, model_output)
print(results)

Advanced Usage¶

Custom Model Configuration¶

from swarms import CouncilAsAJudge

# Initialize with custom model
council = CouncilAsAJudge(
    model_name="anthropic/claude-3-sonnet-20240229",
    output_type="all",
    cache_size=256,
    max_workers=4,
    random_model_name=False
)

Parallel Processing Configuration¶

from swarms import CouncilAsAJudge

# Configure parallel processing
council = CouncilAsAJudge(
    max_workers=8,  # Custom number of worker threads
    random_model_name=True  # Enable dynamic model selection
)

Evaluation Dimensions¶

The council evaluates responses across six key dimensions:

Dimension	Evaluation Criteria
Accuracy	• Factual correctness • Source credibility • Temporal consistency • Technical accuracy
Helpfulness	• Problem-solving efficacy • Solution feasibility • Context inclusion • Proactive addressing of follow-ups
Harmlessness	• Safety assessment • Ethical considerations • Age-appropriateness • Content sensitivity
Coherence	• Structural integrity • Logical flow • Information hierarchy • Transition effectiveness
Conciseness	• Communication efficiency • Information density • Redundancy elimination • Focus maintenance
Instruction Adherence	• Requirement coverage • Constraint compliance • Format matching • Scope appropriateness

API Reference¶

CouncilAsAJudge¶

class CouncilAsAJudge:
    def __init__(
        self,
        id: str = swarm_id(),
        name: str = "CouncilAsAJudge",
        description: str = "Evaluates the model's response across multiple dimensions",
        model_name: str = "gpt-4o-mini",
        output_type: str = "all",
        cache_size: int = 128,
        max_workers: int = None,
        random_model_name: bool = True,
    )

Parameters¶

id (str): Unique identifier for the council
name (str): Display name of the council
description (str): Description of the council's purpose
model_name (str): Name of the model to use for evaluations
output_type (str): Type of output to return
cache_size (int): Size of the LRU cache for prompts
max_workers (int): Maximum number of worker threads
random_model_name (bool): Whether to use random model selection

Methods¶

run¶

def run(self, task: str, model_response: str) -> None

Evaluates a model response across all dimensions.

Parameters¶

task (str): Original user prompt
model_response (str): Model's response to evaluate

Returns¶

Comprehensive evaluation report

Examples¶

Financial Analysis Example¶

from swarms import Agent, CouncilAsAJudge

# Create financial analysis agent
financial_agent = Agent(
    agent_name="Financial-Analysis-Agent",
    system_prompt="You are a financial expert helping users understand and establish ROTH IRAs.",
    model_name="claude-opus-4-20250514",
    max_loops=1,
)

# Run analysis
query = "How can I establish a ROTH IRA to buy stocks and get a tax break?"
response = financial_agent.run(query)

# Evaluate response
council = CouncilAsAJudge()
evaluation = council.run(query, response)
print(evaluation)

Technical Documentation Example¶

from swarms import Agent, CouncilAsAJudge

# Create documentation agent
doc_agent = Agent(
    agent_name="Documentation-Agent",
    system_prompt="You are a technical documentation expert.",
    model_name="gpt-4",
    max_loops=1,
)

# Generate documentation
query = "Explain how to implement a REST API using FastAPI"
response = doc_agent.run(query)

# Evaluate documentation quality
council = CouncilAsAJudge(
    model_name="anthropic/claude-3-sonnet-20240229",
    output_type="all"
)
evaluation = council.run(query, response)
print(evaluation)

Best Practices¶

Model Selection¶

Model Selection Best Practices

Choose appropriate models for your use case
Consider using random model selection for diverse evaluations
Match model capabilities to evaluation requirements

Performance Optimization¶

Performance Tips

Adjust cache size based on memory constraints
Configure worker threads based on CPU cores
Monitor memory usage with large responses

Error Handling¶

Error Handling Guidelines

Implement proper exception handling
Monitor evaluation failures
Log evaluation results for analysis

Resource Management¶

Resource Management

Clean up resources after evaluation
Monitor thread pool usage
Implement proper shutdown procedures

Troubleshooting¶

Memory Issues¶

Memory Problems

If you encounter memory-related problems:

Reduce cache size
Decrease number of worker threads
Process smaller chunks of text

Performance Problems¶

Performance Issues

To improve performance:

Increase cache size
Adjust worker thread count
Use more efficient models

Evaluation Failures¶

Evaluation Issues

When evaluations fail:

Check model availability
Verify input format
Monitor error logs

Contributing¶

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License¶

License

This project is licensed under the MIT License - see the LICENSE file for details.