bolder.bolder.Let's Talk
← Writing

beval-studio

bolder-ai 0.1: The Python SDK for BEval Studio is Live on PyPI

Instrument any LLM, VLM, or agent pipeline in three lines of Python. Fire-and-forget, non-blocking, and ships with auto-wrappers for OpenAI and Anthropic.

·Bolder Team
bolder-ai 0.1: The Python SDK for BEval Studio is Live on PyPI

What shipped

bolder-ai 0.1.0 is live on PyPI:

pip install bolder-ai

It's the official Python SDK for BEval Studio — the hosted evaluation platform for production LLM systems. Install it, set an API key, and every LLM call your app makes shows up in your BEval dashboard alongside latency, token counts, cost, and status.

The distribution name is bolder-ai. The import name is beval.

import beval

beval.init()  # reads BEVAL_API_KEY from env

The three ways to use it

1. Log anything directly

beval.log(
    kind="llm",
    model_id="gpt-4o-mini",
    input="What is the capital of France?",
    output="Paris.",
    latency_ms=312,
    tokens_in=7,
    tokens_out=2,
)

No schema to learn. The fields map 1:1 to what the dashboard displays. Logging is fire-and-forget — the call returns in microseconds and the network POST happens on a background thread. If the gateway is down, your app doesn't care.

2. Auto-wrap your OpenAI or Anthropic client

import beval
from openai import OpenAI

beval.init()
client = beval.wrap(OpenAI())

client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello."}],
)

Every chat.completions.create is captured — input messages, output, token usage, latency, errors. Image parts are detected automatically and logged as kind="vlm" with the image inlined. Same story for beval.wrap(Anthropic()).

Zero changes to your call sites.

3. Decorate agent functions

@beval.trace
def run_agent(query: str) -> str:
    # ...
    return answer

@beval.trace(name="tool:search", kind="agent")
async def search(q): ...

Captures arguments, return value, latency, and exceptions. Works for both sync and async functions. Exceptions are logged with status="failure" and the exception class + message — then re-raised.

What's under the hood

The SDK is deliberately small and boring:

  • One hard dependency: httpx. Optional extras for openai and anthropic integrations.
  • One background thread draining an in-memory queue. 10,000-entry capacity, drop-on-overflow.
  • Retries on 408 / 429 / 5xx with exponential backoff. Network errors never raise to your code.
  • Graceful shutdownatexit drains the queue with a 5-second timeout so logs buffered during the last request still make it out.
  • Redaction hook at init() for scrubbing PII before send.

Installed footprint: ~22 KB wheel.

Why it exists

BEval Studio has run for months on direct API ingest from the VLM mock app and the bolder-fit-agent production workload. Every team integrating BEval wrote essentially the same HTTP client, retry loop, background thread, and OpenAI wrapper. We consolidated it.

The SDK talks to a single endpoint (POST /api/v1/logs/ingest) that's been stable since BEval launched. That means:

  • Existing custom integrations keep working
  • The SDK is backward compatible across BEval dashboard upgrades
  • New SDK versions won't force backend changes

Already running in production

The same day we published bolder-ai 0.1.0 to PyPI, we integrated it into bolder-fit-agent — our coach-facing agentic service that drives multi-turn LiteLLM + MCP agent runs. Every agent turn now shows up in the BEval dashboard with session ID, plan type, turn index, tool calls, and cost.

Total integration diff: 7 files, 159 insertions. One of those files is the SDK init helper. Most of the rest is a single helper function that builds a log entry around the existing LLM call site.

What's next

This is 0.1. The short list for 0.2 and beyond:

  • Batch ingest — today the SDK posts one log per call. Fine for most workloads, wasteful for high-throughput ones. A batch endpoint is the top priority.
  • Streaming support — wrappers currently don't log when stream=True. We'll accumulate chunks and log on stream end, with time-to-first-token as a first-class metric.
  • Traces and spans — flat logs work for single LLM calls; agent runs want a parent/child tree. The SDK will grow a context manager and nested @trace support once the backend trace model lands.
  • LangChain, LlamaIndex, LiteLLM callback integrations.
  • Direct-to-S3 image upload for VLM payloads over ~64 KB.

Get started

  1. Install: pip install bolder-ai
  2. Create an API key in your BEval dashboard → Settings → API Keys
  3. Set BEVAL_API_KEY in your environment
  4. Add three lines to your app — beval.init() at startup, and either beval.log(...), beval.wrap(client), or @beval.trace

Full docs: docs.bolder.services. Source lives on PyPI at pypi.org/project/bolder-ai.


If you're running LLMs in production and want structured visibility without standing up your own observability stack, book a call. We'll get you on the dashboard same-day.

Work with us

Ready to build AI that actually works?

Start a project →