CClaude Code Catalog
All Skills

LLM-as-Judge Evaluator

StrategyAdvanced

Evaluates LLM-generated text, code, and answers across multi-dimensional criteria including accuracy, relevance, completeness, and tone. Also useful for comparing outputs across models.

Trigger/judge
Frequency1-2x/week

AI engineer? Run /judge to quantitatively compare output quality before and after prompt changes

PM managing AI feature quality? Systematically validate AI response quality before launch

LLMEvaluationQuality AssuranceAI

How It Works

Run /judge [evaluation target]
Phase 1: 4 evaluation axes in parallel
accuracy
Accuracy evaluation
relevance
Relevance evaluation
completeness
Completeness evaluation
tone-check
Tone/style evaluation
Aggregate score + improvement suggestions
Evaluation scorecard + improvement points

Skill Code

# LLM-as-Judge Skill ## Trigger: /judge [text or output to evaluate] When invoked: 1. Establish evaluation criteria: - Accuracy: factual correctness (1-5) - Relevance: addresses the question (1-5) - Completeness: covers all aspects (1-5) - Clarity: easy to understand (1-5) - Tone: appropriate for context (1-5) 2. If comparing multiple outputs: - Apply same rubric to each - Generate side-by-side comparison 3. Output format: --- ## ⚖️ LLM Output Evaluation ### Scorecard | Criteria | Score | Notes | |----------|-------|-------| | Accuracy | [X/5] | [specific observation] | | Relevance | [X/5] | [specific observation] | | Completeness | [X/5] | [specific observation] | | Clarity | [X/5] | [specific observation] | | Tone | [X/5] | [specific observation] | | **Total** | **[X/25]** | | ### Strengths - [what the output does well] ### Weaknesses - [specific issues with evidence] ### Improvement Suggestions 1. [actionable improvement with example rewrite] ### Comparison (if multiple) | Criteria | Output A | Output B | Winner | |----------|---------|---------|--------| ---

Copy and paste into your CLAUDE.md to start using immediately.

How LLM-as-Judge Evaluator Works

LLM-as-Judge evaluates AI-generated outputs against structured rubrics — factual accuracy, relevance, coherence, safety — scores each dimension independently, and provides comparative rankings when evaluating multiple model outputs.

When to Use LLM-as-Judge Evaluator

Critical for AI product teams that need to evaluate prompt changes, model upgrades, or fine-tuning results systematically — replaces subjective 'vibes-based' evaluation with reproducible, criteria-based scoring.

Key Strengths

  • Evaluates across multiple quality dimensions independently
  • Provides reproducible scoring with structured rubrics
  • Enables comparative ranking of multiple model outputs
  • Replaces subjective evaluation with systematic methodology

Same Category

Strategy View All

Popular in Other Categories