/neuronio ›› services ›› 05 · evaluation-guardrails

Evaluation & Guardrails
the part nobody ships on time.

continuous · adversarial · auditable

An evaluation suite that runs against every prompt change, every model swap, every shadow deploy. PII redaction, jailbreak detection, drift monitoring. We don't ship without it; we don't recommend you do either.

// 01 intent

Most teams ship the prompt and skip the eval.

The eval is the contract between your model and your users. Without it, you can't tell if today's prompt is better than yesterday's, and you definitely can't tell if a model migration is safe. We write the eval first, then the prompt.

// 02 capabilities

What we actually build.

▣

Golden sets

A curated corpus of 60–500 cases, drawn from your real traffic and edge cases. Versioned with the system.

braintrustmongo

▤

LLM-as-judge

Judges with rubrics, calibrated against human labels. We measure judge-to-human agreement and report it.

claude-judge

▦

Adversarial probes

Red-team queries, prompt injection attempts, jailbreak suites. Run on every PR; surface novel attacks weekly.

promptbenchgandalf

▥

Drift detection

Distributional checks on inputs, outputs, latency, cost. Drift alerts route to on-call before users notice.

evidentlygrafana

▧

Policy guardrails

PII redaction, PHI handling, output schema enforcement, refusal calibration. Tested in the eval suite, not bolted on later.

guardrailsnemo

▨

Shadow deploy

New model? New prompt? Mirror traffic before cutover, score in real time, gate the cutover on the score.

temporallangsmith

// 03 artifact

A peek at real output.

eval-run · main · #4429 · pull-request CI↻ neuronio.ai

› neuronio eval run --suite main --pr 4429 --model claude-sonnet-4.5 SUITE main · 142 cases · 4 judges · seed=42 ▣ correctness 138/142 97.2% ▲ +0.8 vs main ▣ groundedness 140/142 98.6% — no change ▣ tone 142/142 100.0% — no change ▣ refusal_calibration 131/142 92.3% ▼ -2.1 vs main ⚠ ▣ pii_redaction 142/142 100.0% — no change ADVERSARIAL 38 probes ▣ jailbreak 36/38 94.7% ▣ injection 38/38 100.0% ▣ data_exfil 38/38 100.0% LATENCY p50=1.2s p95=3.8s p99=6.1s COST $0.0042/call ▼ -18% vs main VERDICT FAIL // refusal_calibration regressed past threshold // see: evals/refusal/cases/{C-091, C-114, C-127} CI deploy blocked // awaiting fix or override w/ rationale

// 04 deliverables

What lands in your repo.

Eval suite

Golden cases, judges, rubrics. Versioned in your repo, runnable locally and in CI.

Adversarial pack

Red-team probes, jailbreak suites, injection corpus. Updated monthly with novel attacks.

CI integration

GitHub Actions / GitLab / Jenkins. PRs gated on eval; comments include score deltas.

Live monitors

Drift, latency, cost, refusal-rate dashboards. PagerDuty / Slack hookups.

Audit pack

Every eval run signed and stored. Reproducible by your auditors with one command.

// 05 questions

Things people actually ask.

Q-01We already have unit tests. Why do we need this?+

Unit tests check shapes; evals check semantics. The model passing your tests doesn't mean it's giving good answers. Evals are the difference.

Q-02How big should the golden set be?+

Around 60–80 cases gets you signal; 200–500 is comfortable. We start small, expand from real-traffic traces, and prune cases that stop discriminating.

Q-03Doesn't running evals on every PR get expensive?+

Smart sampling, judge caching, and a tier system (cheap/full) bring CI eval cost to roughly $1–4 per PR for typical suites. Cheaper than a regression.

Q-04Can we use the eval suite to compare models?+

That's the point. Side-by-side scoring across models is a one-line invocation. We've used it to talk teams out of paying 4× for fronter models that didn't help on their suite.

Q-05Do you do red-teaming as a standalone?+

Yes — a 2-week red-team engagement that ends with a written report, a corpus, and a fix list ranked by severity.

Tell us the work. We'll tell you the agent.

Open a Channel → All Services ↘

Evaluation & Guardrailsthe part nobody ships on time.