22 Β· Eval Regression Guard: CI-Safe Prompt Versioning π‘ β
Real-world problem: Your team ships a new system prompt. Before it goes to production you need to know: "Does this change make the agent better, worse, or neutral?" Manual testing misses edge cases. Running a golden dataset + comparing to a saved baseline catches regressions before users do.
runEvalSuite + EvalStore give you a persistent, CI-friendly evaluation loop.
What you'll learn β
runEvalSuiteβ run a labeled dataset, score every sample, detect baseline regressionInMemoryEvalStore/SqliteEvalStoreβ persist runs + baselines across CI jobssetBaseline: trueβ save the first clean run as the comparison targetregressionThresholdβ acceptable score drop before the suite fails- Custom scorer β word-overlap F1 for free-form answers (better than exact match)
EvalReportβ structured report: per-sample scores, delta, pass/fail verdict
The workflow β
βββββββββββββββββββββββββββββββ
New prompt v2 β CI: eval-regression.ts β
candidate ββββββ> β β
β 1. Load golden dataset β
β 2. runEvalSuite(v2) β
β 3. Compare to saved baselineβ
β β
β Score Ξ > 5%? β
β yes β exit 1, block PR β
β no β exit 0, allow mergeβ
βββββββββββββββββββββββββββββββSetup β
import {
InMemoryEvalStore,
SqliteEvalStore,
runEvalSuite,
type EvalDatasetItem,
type EvalReport,
type EvalScorer,
} from 'confused-ai/observe';1 Β· Build a golden dataset β
// data/eval-dataset.json (or inline for small suites)
const DATASET: EvalDatasetItem[] = [
{
input: 'What is the return policy?',
expectedOutput: 'You can return any item within 30 days for a full refund.',
},
{
input: 'How do I track my order?',
expectedOutput: 'Visit the Orders page or check your confirmation email for a tracking link.',
},
{
input: 'Do you offer free shipping?',
expectedOutput: 'Free shipping on all orders over $50.',
},
{
input: 'How do I cancel my subscription?',
expectedOutput: 'Go to Account > Subscriptions > Cancel. Takes effect at end of billing period.',
},
{
input: 'Is my payment information secure?',
expectedOutput: 'Yes. We use PCI-DSS compliant payment processing. We never store card numbers.',
},
];Keep this dataset in version control alongside your prompts. Add new cases as you discover regressions.
2 Β· Run baseline (first time) β
import { createAgent } from 'confused-ai';
const store = new SqliteEvalStore('./evals.db'); // persists between CI runs
const agentV1 = createAgent({
name: 'SupportBot-v1',
model: 'gpt-4o-mini',
instructions: 'You are a helpful customer support agent. Answer questions accurately.',
tools: false,
});
const baseline = await runEvalSuite({
suiteName: 'support-qa', // stable name β used to look up past baselines
dataset: DATASET,
agent: agentV1,
store,
scorer: wordOverlapF1, // see Β§4 below
passingScore: 0.6, // each sample must score β₯ 60%
regressionThreshold: 0.05, // fail if average drops > 5 percentage points
setBaseline: true, // β save this run as the reference point
});
console.log(`Baseline set: ${(baseline.averageScore * 100).toFixed(1)}%`);
// Baseline set: 94.2%3 Β· Regression check (every subsequent CI run) β
const agentV2 = createAgent({
name: 'SupportBot-v2',
model: 'gpt-4o-mini',
instructions: 'You are a support assistant. Be brief.', // β prompt change
tools: false,
});
const report = await runEvalSuite({
suiteName: 'support-qa', // same suite β finds the baseline above
dataset: DATASET,
agent: agentV2,
store,
scorer: wordOverlapF1,
passingScore: 0.6,
regressionThreshold: 0.05, // β fail if score drops > 5% from baseline
onSample: (i, total, s) => {
process.stdout.write(` [${i}/${total}] ${s.input.slice(0, 50)}\r`);
},
});
// EvalReport shape:
// {
// suiteRunId: 'run_abc123',
// suiteName: 'support-qa',
// averageScore: 0.81, // 81%
// passedCount: 3,
// totalCount: 5,
// passed: false, // β regression: 94.2% β 81% = Ξ -13.2%
// regressionDelta: -0.132,
// baselineScore: 0.942,
// samples: [...],
// }
if (!report.passed) {
console.error(`Regression detected! Score: ${(report.averageScore * 100).toFixed(1)}%`);
console.error(`Baseline: ${(report.baselineScore! * 100).toFixed(1)}%`);
console.error(`Delta: ${(report.regressionDelta! * 100).toFixed(1)}%`);
process.exit(1); // β blocks the PR in CI
}4 Β· Custom scorer: word-overlap F1 β
Exact match (===) is too strict for free-form support answers. Word-overlap F1 rewards partial matches:
import type { EvalScorer } from 'confused-ai/observe';
function tokenize(text: string): Set<string> {
return new Set(
text.toLowerCase()
.replace(/[^a-z0-9\s]/g, ' ')
.split(/\s+/)
.filter(Boolean),
);
}
const wordOverlapF1: EvalScorer = (_input, expected, actual) => {
if (!expected) return 0.5; // no label β neutral score
const expTokens = tokenize(expected);
const actTokens = tokenize(actual);
let overlap = 0;
for (const t of actTokens) if (expTokens.has(t)) overlap++;
const precision = actTokens.size ? overlap / actTokens.size : 0;
const recall = expTokens.size ? overlap / expTokens.size : 0;
return (precision + recall) ? (2 * precision * recall) / (precision + recall) : 0;
};Other scorer patterns:
// ROUGE-L word overlap
import { rougeLWords } from 'confused-ai';
const rougeScorer: EvalScorer = (_, expected, actual) =>
expected ? rougeLWords(actual, expected) : 0.5;
// LLM-as-judge (semantic, handles paraphrases)
import { runLlmAsJudge } from 'confused-ai';
const judgeScorer: EvalScorer = async (input, expected, actual) => {
const { score } = await runLlmAsJudge({
llm: judgeModel,
rubric: 'Is the response accurate and helpful?',
candidate: actual,
reference: expected,
maxScore: 10,
});
return score / 10; // normalize to 0β1
};5 Β· Print a readable report β
function printReport(report: EvalReport) {
const pct = (n: number) => `${(n * 100).toFixed(1)}%`;
const arrow = report.regressionDelta !== null
? (report.regressionDelta >= 0 ? 'β' : 'β') + pct(Math.abs(report.regressionDelta))
: 'no baseline';
console.log(`Suite : ${report.suiteName}`);
console.log(`Score : ${pct(report.averageScore)} (${report.passedCount}/${report.totalCount} passed)`);
console.log(`Baseline : ${report.baselineScore !== null ? pct(report.baselineScore) : 'none'}`);
console.log(`Delta : ${arrow}`);
console.log(`Status : ${report.passed ? 'β
PASSED' : 'β REGRESSION'}`);
for (const s of report.samples) {
const icon = s.passed ? 'β' : 'β';
console.log(` [${icon}] ${pct(s.score).padEnd(6)} ${s.input.slice(0, 60)}`);
}
}6 Β· Persistent history (SQLite) β
import { SqliteEvalStore } from 'confused-ai/observe';
const store = SqliteEvalStore.create('./evals.db');
// or
const store = createSqliteEvalStore('./evals.db');Query past runs:
const runs = await store.queryRuns('support-qa', 20);
for (const run of runs) {
const flag = run.isBaseline ? ' [BASELINE]' : '';
console.log(`${run.id.slice(0, 8)} ${(run.averageScore * 100).toFixed(1)}% ${run.timestamp}${flag}`);
}7 Β· GitHub Actions CI workflow β
# .github/workflows/eval.yml
name: Eval Regression Check
on:
pull_request:
paths:
- 'src/prompts/**'
- 'src/agents/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Restore eval database
uses: actions/cache@v4
with:
path: ./evals.db
key: eval-db-${{ github.base_ref }}
- name: Run eval suite
run: bun examples/eval-regression.ts
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
EXIT_ON_REGRESSION: '1'
- name: Save eval database
uses: actions/cache@v4
with:
path: ./evals.db
key: eval-db-${{ github.base_ref }}The SQLite file is cached per branch β your baseline survives across CI runs until you explicitly rotate it with setBaseline: true.
8 Β· Updating the baseline β
When you intentionally improve the prompt, update the baseline:
// promote-baseline.ts β run manually after a validated improvement
const report = await runEvalSuite({
suiteName: 'support-qa',
dataset: DATASET,
agent: agentV3,
store,
scorer: wordOverlapF1,
setBaseline: true, // β explicit promotion
});
console.log(`New baseline: ${(report.averageScore * 100).toFixed(1)}%`);Runnable example β
bun examples/eval-regression.tsRuns three back-to-back suites (v1 baseline, v2 regression, v2 fixed) using MockLLMProvider β no API key needed. Prints per-sample scores and the final CI verdict.
Related β
- Observability & Hooks β LLM-as-judge, OTLP export, Langfuse batching
- Production Resilience β circuit breaker, budget enforcement
- Full framework showcase β see eval in a complete system
- Guide: Observability β full
runEvalSuite,EvalStoreAPI reference