bolder.bolder.Let's Talk
← Writing

llm-evaluation

How BEval Studio Evaluates Your LLM — Deterministic Checks, LLM Judges, and Custom Rubrics

A technical walkthrough of BEval Studio's three-layer evaluation engine: synchronous deterministic checks on every log, sampled LLM judge scoring, and tenant-configurable rubrics.

·Bolder Team
How BEval Studio Evaluates Your LLM — Deterministic Checks, LLM Judges, and Custom Rubrics

Most LLM evaluation is either too slow or too shallow

Teams monitoring LLM systems in production tend to fall into one of two traps.

The first trap is vibes-based monitoring. Someone spot-checks a handful of outputs every week, flags the obviously bad ones, and assumes the rest are fine. This works until it doesn't — and when it stops working, the failure has usually been compounding silently for days.

The second trap is over-reliance on LLM-as-a-judge. Every single output gets sent to another model for scoring. It's expensive, slow, and adds latency to your feedback loop. Worse, it gives you a false sense of coverage — an LLM judge can miss the same systematic failure mode hundreds of times if it wasn't prompted to look for it.

BEval Studio avoids both traps by running evaluation in three layers, each with a different cost and coverage profile.

Layer 1: Deterministic checks — every log, zero LLM calls

The first layer runs synchronously on every log entry the moment it's ingested. These are fast, deterministic checks that catch structural failures without any model call:

  • Required field presence — did the model return all the fields your schema expects?
  • Type validation — is the date_of_birth field actually a date, or did the model return a string like "around 1985"?
  • Date format enforcement — YYYY-MM-DD, not "March 12th" or "12/03/85"
  • Empty array detection — the model said it found medications, but the array is empty
  • Plausibility checks — an extracted age of 847 or a date in the year 2089
  • Empty string detection — fields that exist but contain nothing useful

These checks produce a deterministic score (det_score, 0.0 to 1.0) and a pass/fail breakdown per check. The results land in your dashboard immediately — no queue, no delay, no cost.

This layer alone catches a surprising percentage of production failures. Most teams we work with find that 30-50% of their quality issues are structural, not semantic. A missing field or a malformed date is not a subtle problem — it's a hard failure that breaks downstream systems. Catching these instantly, on every single log, is the foundation everything else builds on.

Layer 2: LLM judges — sampled, not exhaustive

The second layer dispatches selected logs to LLM judges for deeper evaluation. The key word is selected. Not every log needs an LLM judge. Running judges on every output is wasteful when most outputs are fine.

BEval Studio's sampling logic decides which logs get judged:

  • Always if the deterministic checks failed (det_passed == false)
  • Always if the deterministic score is below 0.9
  • Random 2% baseline for drift detection on logs that look structurally fine

The baseline sample rate is configurable per tenant — if you want 5% or 10%, you set it. The point is that you're not paying for LLM judge calls on outputs that already passed every structural check. You're focusing judge capacity on the outputs that need scrutiny and maintaining a random sample to catch drift.

By default, judges evaluate three dimensions:

  1. Hallucination — are the extracted values actually traceable to the source input, or did the model invent information?
  2. Completeness — did the model capture everything present in the source, or did it silently drop fields?
  3. Semantic accuracy — are the values correct? Did it get the right date, the right dosage, the right diagnosis?

Each dimension produces a score and a reasoning trace explaining the judgment. These land in the log detail view as expandable rows with score bars and full explanations.

The default judge model is Claude Haiku — fast and cheap enough for high-volume sampling. But every tenant can override this with their own judge configuration: pick a different model, change which dimensions are evaluated, or provide a custom system prompt that aligns the judge with your domain.

Layer 3: Custom rubrics — your criteria, scored automatically

The third layer is rubrics. A rubric is a named, reusable set of evaluation criteria that you define:

Medical Extraction v1
├── Patient demographics correct  (weight: 0.3, passing: 0.8)
├── Medication list complete      (weight: 0.25, passing: 0.7)
├── Dates in correct format       (weight: 0.2, passing: 0.9)
├── No hallucinated conditions    (weight: 0.15, passing: 0.85)
└── Dosage values plausible       (weight: 0.1, passing: 0.8)

Each criterion has a weight (how much it matters relative to the others) and a passing threshold. When a rubric is attached to a log or an eval run, BEval scores the output against every criterion and produces a per-criterion breakdown with scores and reasoning.

Rubrics are tenant-scoped. You can have as many as you need — one per use case, one per document type, one per model version you're testing. They're managed through the Rubrics page in the dashboard, not in code.

This is where BEval becomes genuinely useful for teams that have domain expertise. You know what "good" looks like for your system. Rubrics let you encode that knowledge into a repeatable, automated evaluation that runs at scale.

How the layers work together

The three layers aren't alternatives — they're sequential.

When a log is ingested:

  1. Deterministic checks run immediately, synchronously. You get det_score and det_passed in the response.
  2. The sampling logic evaluates whether this log should go to judges. If yes, the judge task is queued and runs asynchronously in the background.
  3. If a rubric is attached (either at ingest time or via an eval run), rubric scoring runs alongside the judge evaluation.

In the dashboard, you see all three layers for every log:

  • Det checks as a pass/fail list with flag details
  • Judge scores as horizontal bars per dimension with expandable reasoning
  • Rubric results as per-criterion breakdowns with pass/fail indicators

The home dashboard aggregates these across your entire traffic: average det score, average judge score, rubric pass rates, failure type breakdowns, and all of it sliceable by timeframe.

What this means in practice

A typical production day might look like this:

  • 10,000 logs ingested
  • 10,000 deterministic check results (instant)
  • 400 logs sent to judges (200 from failed det checks + 200 from random sample)
  • 400 judge evaluations completed within minutes
  • Dashboard shows det_score trending down 3% on a specific kind of log — investigation reveals a prompt regression introduced that morning

You didn't have to manually review anything. You didn't have to run judges on all 10,000 logs. The system surfaced the degradation automatically, with the specific dimension (hallucination, completeness, accuracy) that's driving it.

That's the evaluation engine. It's not one technique — it's three layers working together, each optimized for a different failure class, all feeding into the same dashboard.


If you're shipping LLM outputs to production and want evaluation that actually scales, book a call. We'll walk through what BEval looks like against your system.

Work with us

Ready to build AI that actually works?

Start a project →