Benchmarks & Arena
Benchmark local AI models on real tool execution and pit models against each other in Elo-rated arena battles
Overview
Lite Suite includes two benchmarking systems: LiteBench (automated tool-execution benchmarks) and the Arena (head-to-head model battles with Elo ratings). Both run as panels inside the workspace.
LiteBench is also available as a standalone open-source app for users who want benchmarking without the full Lite Suite.
LiteBench — Tool Execution Benchmarks
The first benchmark that actually executes tools with local AI models. Models navigate websites in an embedded browser, search the web, execute code in a sandbox, and fetch URLs. Not just JSON format checking.
Agent Tools
8 tools with real execution:
| Tool | Description |
|------|-------------|
| browser_go | Navigate to a URL |
| browser_elements | List page elements |
| browser_click | Click an element |
| browser_type | Type into a field |
| web_search | Search the web |
| web_fetch | Fetch a URL |
| sandbox | Execute code |
| youtube | Extract video info |
Model Leaderboard
Scores from real tool execution — 5 tests covering browser navigation, web search, page reading, code sandbox, and URL fetching.
| Model | Params | Score | |-------|--------|-------| | Jackrong 0.8B Opus Distill | 0.8B | 100% | | Qwen 3 4B | 4B | 100% | | Devstral Small 2 | 24B | 100% | | Gemma 4 31B Opus Distill | 31B | 100% | | Gemma 4 E4B | ~4B | 100% |
Text Benchmarks
6 additional benchmark suites: Creator, Standard, Speed, Stress, Judgment, and Multimodal.
Arena — Model Battles
Pit two AI models against each other on the same prompt:
- Select two models and a prompt
- Both generate responses simultaneously
- Vote on which output is better
- Elo ratings update based on the outcome
Results persist in a local SQLite database. Track leaderboard standings over time with a gallery of past battles.
Compatible Endpoints
Any OpenAI-compatible API:
Requirements
- Python 3.10+ on PATH (for sandbox and web tools)
- An OpenAI-compatible endpoint with a loaded model
- No GPU required by the benchmark itself — GPU needed only for model inference
Troubleshooting
Tools return errors. Open the tool call card to see the full error. Most common: Python not found, or pip packages not installed.
Browser panel is blank. The embedded browser requires the Browser panel to be open before asking the agent to browse.
