CouncilAsAJudge¶
The CouncilAsAJudge is a sophisticated evaluation system that employs multiple AI agents to assess model responses across various dimensions. It provides comprehensive, multi-dimensional analysis of AI model outputs through parallel evaluation and aggregation.
Overview¶
The CouncilAsAJudge implements a council of specialized AI agents that evaluate different aspects of a model's response. Each agent focuses on a specific dimension of evaluation, and their findings are aggregated into a comprehensive report.
graph TD
    A[User Query] --> B[Base Agent]
    B --> C[Model Response]
    C --> D[CouncilAsAJudge]
    subgraph "Evaluation Dimensions"
        D --> E1[Accuracy Agent]
        D --> E2[Helpfulness Agent]
        D --> E3[Harmlessness Agent]
        D --> E4[Coherence Agent]
        D --> E5[Conciseness Agent]
        D --> E6[Instruction Adherence Agent]
    end
    E1 --> F[Evaluation Aggregation]
    E2 --> F
    E3 --> F
    E4 --> F
    E5 --> F
    E6 --> F
    F --> G[Comprehensive Report]
    style D fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#bbf,stroke:#333,stroke-width:2px
Key Features¶
- Parallel evaluation across multiple dimensions
 - Caching system for improved performance
 - Dynamic model selection
 - Comprehensive evaluation metrics
 - Thread-safe execution
 - Detailed technical analysis
 
Installation¶
Basic Usage¶
from swarms import Agent, CouncilAsAJudge
# Create a base agent
base_agent = Agent(
    agent_name="Financial-Analysis-Agent",
    system_prompt="You are a financial expert helping users understand and establish ROTH IRAs.",
    model_name="claude-opus-4-20250514",
    max_loops=1,
)
# Run the base agent
user_query = "How can I establish a ROTH IRA to buy stocks and get a tax break?"
model_output = base_agent.run(user_query)
# Create and run the council
panel = CouncilAsAJudge()
results = panel.run(user_query, model_output)
print(results)
Advanced Usage¶
Custom Model Configuration¶
from swarms import CouncilAsAJudge
# Initialize with custom model
council = CouncilAsAJudge(
    model_name="anthropic/claude-3-sonnet-20240229",
    output_type="all",
    cache_size=256,
    max_workers=4,
    random_model_name=False
)
Parallel Processing Configuration¶
from swarms import CouncilAsAJudge
# Configure parallel processing
council = CouncilAsAJudge(
    max_workers=8,  # Custom number of worker threads
    random_model_name=True  # Enable dynamic model selection
)
Evaluation Dimensions¶
The council evaluates responses across six key dimensions:
| Dimension | Evaluation Criteria | 
|---|---|
| Accuracy | • Factual correctness • Source credibility • Temporal consistency • Technical accuracy  | 
| Helpfulness | • Problem-solving efficacy • Solution feasibility • Context inclusion • Proactive addressing of follow-ups  | 
| Harmlessness | • Safety assessment • Ethical considerations • Age-appropriateness • Content sensitivity  | 
| Coherence | • Structural integrity • Logical flow • Information hierarchy • Transition effectiveness  | 
| Conciseness | • Communication efficiency • Information density • Redundancy elimination • Focus maintenance  | 
| Instruction Adherence | • Requirement coverage • Constraint compliance • Format matching • Scope appropriateness  | 
API Reference¶
CouncilAsAJudge¶
class CouncilAsAJudge:
    def __init__(
        self,
        id: str = swarm_id(),
        name: str = "CouncilAsAJudge",
        description: str = "Evaluates the model's response across multiple dimensions",
        model_name: str = "gpt-4o-mini",
        output_type: str = "all",
        cache_size: int = 128,
        max_workers: int = None,
        random_model_name: bool = True,
    )
Parameters¶
id(str): Unique identifier for the councilname(str): Display name of the councildescription(str): Description of the council's purposemodel_name(str): Name of the model to use for evaluationsoutput_type(str): Type of output to returncache_size(int): Size of the LRU cache for promptsmax_workers(int): Maximum number of worker threadsrandom_model_name(bool): Whether to use random model selection
Methods¶
run¶
Evaluates a model response across all dimensions.
Parameters¶
task(str): Original user promptmodel_response(str): Model's response to evaluate
Returns¶
- Comprehensive evaluation report
 
Examples¶
Financial Analysis Example¶
from swarms import Agent, CouncilAsAJudge
# Create financial analysis agent
financial_agent = Agent(
    agent_name="Financial-Analysis-Agent",
    system_prompt="You are a financial expert helping users understand and establish ROTH IRAs.",
    model_name="claude-opus-4-20250514",
    max_loops=1,
)
# Run analysis
query = "How can I establish a ROTH IRA to buy stocks and get a tax break?"
response = financial_agent.run(query)
# Evaluate response
council = CouncilAsAJudge()
evaluation = council.run(query, response)
print(evaluation)
Technical Documentation Example¶
from swarms import Agent, CouncilAsAJudge
# Create documentation agent
doc_agent = Agent(
    agent_name="Documentation-Agent",
    system_prompt="You are a technical documentation expert.",
    model_name="gpt-4",
    max_loops=1,
)
# Generate documentation
query = "Explain how to implement a REST API using FastAPI"
response = doc_agent.run(query)
# Evaluate documentation quality
council = CouncilAsAJudge(
    model_name="anthropic/claude-3-sonnet-20240229",
    output_type="all"
)
evaluation = council.run(query, response)
print(evaluation)
Best Practices¶
Model Selection¶
Model Selection Best Practices
- Choose appropriate models for your use case
 - Consider using random model selection for diverse evaluations
 - Match model capabilities to evaluation requirements
 
Performance Optimization¶
Performance Tips
- Adjust cache size based on memory constraints
 - Configure worker threads based on CPU cores
 - Monitor memory usage with large responses
 
Error Handling¶
Error Handling Guidelines
- Implement proper exception handling
 - Monitor evaluation failures
 - Log evaluation results for analysis
 
Resource Management¶
Resource Management
- Clean up resources after evaluation
 - Monitor thread pool usage
 - Implement proper shutdown procedures
 
Troubleshooting¶
Memory Issues¶
Memory Problems
If you encounter memory-related problems:
- Reduce cache size
 - Decrease number of worker threads
 - Process smaller chunks of text
 
Performance Problems¶
Performance Issues
To improve performance:
- Increase cache size
 - Adjust worker thread count
 - Use more efficient models
 
Evaluation Failures¶
Evaluation Issues
When evaluations fail:
- Check model availability
 - Verify input format
 - Monitor error logs
 
Contributing¶
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License¶
License
This project is licensed under the MIT License - see the LICENSE file for details.