The evaluation dataset problem
Every team that evaluates LLMs hits the same bottleneck: where do the test cases come from?
If you build test cases by hand, they're biased toward the scenarios you already thought of. They don't reflect the distribution of real production queries — the edge cases, the malformed inputs, the queries that combine three document types you didn't expect to see together.
If you use synthetic data, you're evaluating your model on inputs generated by another model. The failure modes of synthetic test sets rarely overlap with the failure modes of real traffic.
The best evaluation datasets come from production. The queries your system actually handles. The outputs it actually produces. The failures your users actually encounter. But getting production data into a structured eval pipeline is usually a manual, painful process — export logs, clean them, format them, upload them somewhere, hope the schema matches.
BEval Studio makes this a two-click workflow.
Curating datasets from production logs
Every log that flows through BEval Studio is a potential data point. The platform gives you three ways to turn production logs into eval datasets:
From the log list. Browse your production logs, filter by kind, status, score range, date, or review status. When you find logs worth testing against, check the boxes. A floating action bar appears: 3 selected - Add to Dataset - Clear. Click "Add to Dataset", pick or create a dataset, and you're done.
From a single log detail. When you're reviewing a specific log and realize it's a good test case — maybe it's a tricky edge case the model handled well, or a failure you want to regression-test against — hit "Add to Dataset" in the top-right corner.
From the dataset page itself. Open a dataset, click "Add from Logs", and a full log browser opens in a drawer. Same filters as the main log list, multi-select, confirm. The logs are added directly to the current dataset without any intermediate steps.
Every data point created this way carries clear provenance. In the dataset view, you'll see a FROM LOG badge linking back to the original log entry. You always know where a test case came from and can trace it back to the exact production event.
Uploading external test sets
Not all evaluation data comes from production. Some comes from manual annotation, research collaborations, or domain experts who maintain curated test sets in spreadsheets.
BEval Studio accepts three formats: JSON, JSONL, and CSV.
The upload flow groups files into batches — a named container for related uploads. Within a batch, each file is tracked individually with its own status: pending, processing, done, or failed. If a file fails (bad format, schema mismatch), you can retry it without re-uploading the entire batch.
During upload, you map your file's fields to BEval's schema:
- Which column or key is the input (the prompt or source text)?
- Which column or key is the expected output (ground truth, if you have it)?
- Everything else folds into metadata automatically — tags, version labels, annotator notes, whatever your file contains.
Uploaded data points carry an UPLOAD badge showing the batch name and source file. Same provenance story as log-sourced points — you always know the origin.
What a dataset looks like
A dataset in BEval Studio is a named container of data points. Each point has:
- Input — the prompt, query, or source document (stored as structured JSON, supports multimodal via attachments)
- Expected output — ground truth if available (optional — you can run evals without it)
- Metadata — free-form key-value pairs rendered as collapsible chips, not raw JSON
- Provenance — exactly where this point came from and when it was added
- Attachments — images, PDFs, or audio files for multimodal inputs
The dataset detail page has three tabs:
Points shows every data point with its provenance badge, input preview, expected output indicator, metadata chips, and attachment count. You can filter by source type and search across metadata.
Uploads shows the history of file uploads — batches, individual files within each batch, their processing status, row counts, and any errors. Failed files have a retry button.
Runs shows every eval run that's been executed against this dataset. Which model, which rubric, what score, when it ran.
Running evaluations
Once you have a dataset, you can run it against any model you've connected through BEval Studio's model integrations.
Click "Run Eval" on a dataset. Pick a connected model. Optionally attach a rubric for structured scoring. Confirm.
The eval run dispatches asynchronously — each data point's input is sent to the selected model, the output is captured, and if a rubric is attached, judges score the result. Results stream in per data point. You don't wait for the entire run to complete before seeing results.
The eval run detail page shows:
- Overall progress (completed points out of total)
- Aggregate scores
- Per-data-point results: input, model output, score bar, latency, judge dimension breakdowns
- Any failures with error messages
This is where the dataset pipeline pays off. You're not testing against synthetic benchmarks. You're testing against the actual queries your system handles in production, curated by the people who understand your domain, evaluated with your rubrics and your judge configurations.
The feedback loop
The real power is the loop this creates:
- Your system runs in production. BEval captures every log and evaluates it.
- You spot a failure class — maybe hallucination on a specific document type, or missing fields on a new input format.
- You cherry-pick those failing logs into an eval dataset.
- You fix your prompt, swap your model, or adjust your pipeline.
- You run the eval dataset against the new configuration.
- You see, per data point, whether the fix actually worked — not on synthetic data, but on the exact production cases that failed before.
This is regression testing for LLMs. Not abstract benchmarks. Not vibes. Structured, repeatable evaluation against real production data, with full provenance and scoring at every step.
If you want to stop guessing whether your LLM changes actually improved things, book a call. We'll show you what the dataset-to-eval pipeline looks like on your production data.
