LLM-as-Judge Evaluator

StrategyAdvanced

Evaluates LLM-generated text, code, and answers across multi-dimensional criteria including accuracy, relevance, completeness, and tone. Also useful for comparing outputs across models.

Trigger/judge

Frequency1-2x/week

AI engineer? Run /judge to quantitatively compare output quality before and after prompt changes

PM managing AI feature quality? Systematically validate AI response quality before launch

LLMEvaluationQuality AssuranceAI

How It Works

Run /judge [evaluation target]

↓

Phase 1: 4 evaluation axes in parallel

accuracy

Accuracy evaluation

relevance

Relevance evaluation

completeness

Completeness evaluation

tone-check

Tone/style evaluation

↓

Aggregate score + improvement suggestions

↓

✓ Evaluation scorecard + improvement points

Skill Code

# LLM-as-Judge Skill
## Trigger: /judge [text or output to evaluate]

When invoked:

1. Establish evaluation criteria:
 - Accuracy: factual correctness (1-5)
 - Relevance: addresses the question (1-5)
 - Completeness: covers all aspects (1-5)
 - Clarity: easy to understand (1-5)
 - Tone: appropriate for context (1-5)

2. If comparing multiple outputs:
 - Apply same rubric to each
 - Generate side-by-side comparison

3. Output format:
---
## ⚖️ LLM Output Evaluation

### Scorecard
| Criteria | Score | Notes |
|----------|-------|-------|
| Accuracy | [X/5] | [specific observation] |
| Relevance | [X/5] | [specific observation] |
| Completeness | [X/5] | [specific observation] |
| Clarity | [X/5] | [specific observation] |
| Tone | [X/5] | [specific observation] |
| **Total** | **[X/25]** | |

### Strengths
- [what the output does well]

### Weaknesses
- [specific issues with evidence]

### Improvement Suggestions
1. [actionable improvement with example rewrite]

### Comparison (if multiple)
| Criteria | Output A | Output B | Winner |
|----------|---------|---------|--------|
---

Copy and paste into your CLAUDE.md to start using immediately.

How LLM-as-Judge Evaluator Works

LLM-as-Judge evaluates AI-generated outputs against structured rubrics — factual accuracy, relevance, coherence, safety — scores each dimension independently, and provides comparative rankings when evaluating multiple model outputs.

When to Use LLM-as-Judge Evaluator

Critical for AI product teams that need to evaluate prompt changes, model upgrades, or fine-tuning results systematically — replaces subjective 'vibes-based' evaluation with reproducible, criteria-based scoring.

Key Strengths

Evaluates across multiple quality dimensions independently
Provides reproducible scoring with structured rubrics
Enables comparative ranking of multiple model outputs
Replaces subjective evaluation with systematic methodology

Same Category

Strategy View All

PRD Writer

Generates comprehensive PRDs through structured Socratic questioning.

Competitor Analysis

Systematically analyzes competitors and generates comparison matrices.

User Story Generator

Generates user stories with acceptance criteria from requirements.

Popular in Other Categories

Session Summary

WorkflowAutomatically summarizes changes and next steps at the end of a work session.

Smart Commit

CodingAnalyzes changes and auto-generates meaningful commit messages.

CLAUDE.md Builder

ProductivityAnalyzes your project and auto-generates an optimized CLAUDE.md.