LLM-as-Judge 평가

전략/기획고급

LLM이 생성한 텍스트, 코드, 답변 등을 정확성, 관련성, 완성도, 톤 등 다차원 기준으로 평가합니다. 여러 모델의 출력을 비교할 때도 유용합니다.

트리거/judge

사용빈도주 1-2회

AI 엔지니어라면? /judge로 프롬프트 변경 전후 출력 품질을 정량적으로 비교

PM이 AI 기능 품질을 관리할 때라면? 출시 전 AI 응답 품질을 체계적으로 검증

LLM평가품질 관리AI

작동 흐름

/judge [평가 대상] 실행

↓

Phase 1: 4개 평가 축 병렬

accuracy

정확성 평가

relevance

스킬 코드

# LLM-as-Judge Skill
## Trigger: /judge [text or output to evaluate]

When invoked:

1. Establish evaluation criteria:
 - Accuracy: factual correctness (1-5)
 - Relevance: addresses the question (1-5)
 - Completeness: covers all aspects (1-5)
 - Clarity: easy to understand (1-5)
 - Tone: appropriate for context (1-5)

2. If comparing multiple outputs:
 - Apply same rubric to each
 - Generate side-by-side comparison

3. Output format:
---
## ⚖️ LLM Output Evaluation

### Scorecard
| Criteria | Score | Notes |
|----------|-------|-------|
| Accuracy | [X/5] | [specific observation] |
| Relevance | [X/5] | [specific observation] |
| Completeness | [X/5] | [specific observation] |
| Clarity | [X/5] | [specific observation] |
| Tone | [X/5] | [specific observation] |
| **Total** | **[X/25]** | |

### Strengths
- [what the output does well]

### Weaknesses
- [specific issues with evidence]

### Improvement Suggestions
1. [actionable improvement with example rewrite]

### Comparison (if multiple)
| Criteria | Output A | Output B | Winner |
|----------|---------|---------|--------|
---

복사해서 CLAUDE.md에 붙여넣으면 바로 사용할 수 있습니다.