SENTIENT

CLI + SDK

The evaluation and self-improving layer for the internet of Agents.

Sentient gives teams the infrastructure to create evals from traces and datasets, run agents in managed sandboxes, inspect trajectories, compare regressions, and turn real failures into signal for self-improving agents.

sentient-eval-loop
LIVE
user@dev:~/project$
Commands executed:1
Eval loop:Online
Status:Ready

Community

Discover shared RL environments, eval benchmarks, and regression suites built by agent teams.

community.sentient.dev
RL ENVS + BENCHMARKS
Featured Community Assets
browser-tool-rl-envby sentient-labs(rl-env)

Browser tool-use environment for training and evaluating web agents

128 tasksRuntime: E2B + judgeAccess: open
42
2
8 discussions
swe-bench-lite-forkby community(benchmark)

Forkable software-engineering benchmark with verifier scripts

300 tasksRuntime: Verifier scriptAccess: forkable
127
5
23 discussions
tool-use-safety-suiteby agent-team(regression)

Regression suite for tool-call safety, permissions, and recovery behavior

18 tasksRuntime: LLM judgeAccess: forkable
89
3
15 discussions
Community Assets:
• Shared RL environments
• Forkable eval benchmarks
• Public regression suites
3 Featured
446 Tasks
Open network
RL environments • Eval benchmarks • Regression suites

Run any agent harness against the same eval suite

Evaluate CLI agents and deployed artifacts with shared datasets, sandboxes, graders, and run settings.

agent-harnesses
EVAL READY
$ Supported Agent Runners
cursor-cli
Adapter mode
claude-code
Adapter mode
codex
API key or auth.json
opencode
Adapter mode
goose
Adapter mode
swe-agent
Hosted evals
$ Fork dataset # Edit tasks in Playground
$ sen tracing doctor # Check tracing readiness
6 Runners
E2B · Daytona · Modal
Adapter mode • Deployed artifact mode • Managed sandboxes

Evaluation cockpit for agent teams

Launch benchmark runs, inspect trajectories, compare regressions, and tune graders before rollout.

dashboard.sentient.dev
2 EVAL RUNS
Evaluation Runs
prod-failures-regression(suite)
cursor-cli · claude-sonnet
Trials: 64Failed: 9Regression delta: -6.4%
swe-bench-lite-fork(benchmark)
codex · gpt-4.1
Trials: 300Failed: 42Regression delta: +3.1%

Pass Rate (Last 7 Days)

Benchmark pass rate trend (%)

1007550250
MonTueWedThuFriSatSun
40%
45%
38%
50%
55%
48%
58%
Pass rate: 58%
Recovered: 12 tasks
Regressions: 3 tasks
Features:
• Benchmark and regression runs
• Trajectory inspection
• Grader results and run comparison
2 Running
51 Passing
Live grading
Trajectories • Grader results • Regression deltas

From evals to self-improving agents

Evals are the foundation. Sentient is building toward RL and post-training meta-harnesses that use trajectories, grader feedback, benchmark results, and regression history as the signal for improving agents.

Evaluation datasets become training signal
Failed trajectories become repair tasks
Grader feedback becomes reward signal
Regression suites become safety rails
Human and automated review feed one loop
RL post-training harnesses are the adventure goal

Ready to evaluate and improve your agents?

Start with a benchmark, fork a dataset, or turn production failures into regression suites. Sentient gives you the eval infrastructure to measure behavior and make agents better.

01

Initialize

Start a Sentient project with agent and eval-ready defaults

$ sen init
02

Deploy Agent

Ship an agent artifact that can be traced, logged, and evaluated

$ sen push
03

Inspect Failures

Pull live failure logs that feed debugging and regression creation

$ sen logs --live
Start Evaluating
$ pip install sentient-cli