Benchmarks & Arena

Benchmark local AI models on real tool execution and pit models against each other in Elo-rated arena battles

Overview

Lite Suite includes two benchmarking systems: LiteBench (automated tool-execution benchmarks) and the Arena (head-to-head model battles with Elo ratings). Both run as panels inside the workspace.

LiteBench is also available as a standalone open-source app for users who want benchmarking without the full Lite Suite.

LiteBench — Tool Execution Benchmarks

The first benchmark that actually executes tools with local AI models. Models navigate websites in an embedded browser, search the web, execute code in a sandbox, and fetch URLs. Not just JSON format checking.

Agent Tools

8 tools with real execution:

| Tool | Description | |------|-------------| | browser_go | Navigate to a URL | | browser_elements | List page elements | | browser_click | Click an element | | browser_type | Type into a field | | web_search | Search the web | | web_fetch | Fetch a URL | | sandbox | Execute code | | youtube | Extract video info |

Model Leaderboard

Scores from real tool execution — 5 tests covering browser navigation, web search, page reading, code sandbox, and URL fetching.

| Model | Params | Score | |-------|--------|-------| | Jackrong 0.8B Opus Distill | 0.8B | 100% | | Qwen 3 4B | 4B | 100% | | Devstral Small 2 | 24B | 100% | | Gemma 4 31B Opus Distill | 31B | 100% | | Gemma 4 E4B | ~4B | 100% |

Text Benchmarks

6 additional benchmark suites: Creator, Standard, Speed, Stress, Judgment, and Multimodal.

Arena — Model Battles

Pit two AI models against each other on the same prompt:

Select two models and a prompt
Both generate responses simultaneously
Vote on which output is better
Elo ratings update based on the outcome

Results persist in a local SQLite database. Track leaderboard standings over time with a gallery of past battles.

Compatible Endpoints

Any OpenAI-compatible API:

LM Studio (recommended)
Ollama
llama.cpp
vLLM

Requirements

Python 3.10+ on PATH (for sandbox and web tools)
An OpenAI-compatible endpoint with a loaded model
No GPU required by the benchmark itself — GPU needed only for model inference

Troubleshooting

Tools return errors. Open the tool call card to see the full error. Most common: Python not found, or pip packages not installed.

Browser panel is blank. The embedded browser requires the Browser panel to be open before asking the agent to browse.