Sentient - Evaluation and Self-Improving Layer for Agents

SENTIENT

███████╗███████╗███╗   ██╗████████╗██╗███████╗███╗   ██╗████████╗
██╔════╝██╔════╝████╗  ██║╚══██╔══╝██║██╔════╝████╗  ██║╚══██╔══╝
███████╗█████╗  ██╔██╗ ██║   ██║   ██║█████╗  ██╔██╗ ██║   ██║   
╚════██║██╔══╝  ██║╚██╗██║   ██║   ██║██╔══╝  ██║╚██╗██║   ██║   
███████║███████╗██║ ╚████║   ██║   ██║███████╗██║ ╚████║   ██║   
╚══════╝╚══════╝╚═╝  ╚═══╝   ╚═╝   ╚═╝╚══════╝╚═╝  ╚═══╝   ╚═╝

CLI + SDK

The evaluation and self-improving layer for the internet of Agents.

Sentient gives teams the infrastructure to create evals from traces and datasets, run agents in managed sandboxes, inspect trajectories, compare regressions, and turn real failures into signal for self-improving agents.

START NOW

→Read Eval Docs

sentient-eval-loop

LIVE

user@dev:~/project$ █

Commands executed:1

Eval loop:Online

Status:Ready

Community

Discover shared RL environments, eval benchmarks, and regression suites built by agent teams.

community.sentient.dev

RL ENVS + BENCHMARKS

Featured Community Assets

●browser-tool-rl-envby sentient-labs(rl-env)

Browser tool-use environment for training and evaluating web agents

128 tasksRuntime: E2B + judgeAccess: open

▲42

▼2

8 discussions

●swe-bench-lite-forkby community(benchmark)

Forkable software-engineering benchmark with verifier scripts

300 tasksRuntime: Verifier scriptAccess: forkable

▲127

▼5

23 discussions

●tool-use-safety-suiteby agent-team(regression)

Regression suite for tool-call safety, permissions, and recovery behavior

18 tasksRuntime: LLM judgeAccess: forkable

▲89

▼3

15 discussions

Community Assets:

• Shared RL environments

• Forkable eval benchmarks

• Public regression suites

3 Featured

446 Tasks

Open network

●RL environments • Eval benchmarks • Regression suites

Run any agent harness against the same eval suite

Evaluate CLI agents and deployed artifacts with shared datasets, sandboxes, graders, and run settings.

agent-harnesses

EVAL READY

$ Supported Agent Runners

✓cursor-cli

Adapter mode

✓claude-code

Adapter mode

✓codex

API key or auth.json

✓opencode

Adapter mode

✓goose

Adapter mode

✓swe-agent

Hosted evals

$ Fork dataset # Edit tasks in Playground

$ sen tracing doctor # Check tracing readiness

6 Runners

E2B · Daytona · Modal

●Adapter mode • Deployed artifact mode • Managed sandboxes

Evaluation cockpit for agent teams

Launch benchmark runs, inspect trajectories, compare regressions, and tune graders before rollout.

dashboard.sentient.dev

2 EVAL RUNS

Evaluation Runs

●prod-failures-regression(suite)

cursor-cli · claude-sonnet

Trials: 64Failed: 9Regression delta: -6.4%

●swe-bench-lite-fork(benchmark)

codex · gpt-4.1

Trials: 300Failed: 42Regression delta: +3.1%

Pass Rate (Last 7 Days)

Benchmark pass rate trend (%)

1007550250

MonTueWedThuFriSatSun

40%

45%

38%

50%

55%

48%

58%

Pass rate: 58%

Recovered: 12 tasks

Regressions: 3 tasks

Features:

• Benchmark and regression runs

• Trajectory inspection

• Grader results and run comparison

2 Running

51 Passing

Live grading

●Trajectories • Grader results • Regression deltas

From evals to self-improving agents

Evals are the foundation. Sentient is building toward RL and post-training meta-harnesses that use trajectories, grader feedback, benchmark results, and regression history as the signal for improving agents.

●Evaluation datasets become training signal

●Failed trajectories become repair tasks

●Grader feedback becomes reward signal

●Regression suites become safety rails

●Human and automated review feed one loop

●RL post-training harnesses are the adventure goal

Ready to evaluate and improve your agents?

Start with a benchmark, fork a dataset, or turn production failures into regression suites. Sentient gives you the eval infrastructure to measure behavior and make agents better.

Initialize

Start a Sentient project with agent and eval-ready defaults

$ sen init

Deploy Agent

Ship an agent artifact that can be traced, logged, and evaluated

$ sen push

Inspect Failures

Pull live failure logs that feed debugging and regression creation

$ sen logs --live

▶Start Evaluating

$ pip install sentient-cli