These videos show agents tackling real-world economic tasks from EconWebArena on the web. Playback is shown at 5× speed.
The EconWebArena benchmark covers a wide range of economic domains and task types. Each task is grounded in a real website and designed to test agents' abilities in navigation, retrieval, and reasoning.
The benchmark draws from 82 authoritative websites, including government portals, financial databases, and statistical agencies. Below are representative examples of the diverse web interfaces agents interact with.
Each EconWebArena task is grounded in a real website and framed as a precise, answerable question. The examples below illustrate the diversity of domains and data formats, from inflation statistics to market prices and interest rates.
Category | Task Description | Start URL | Answer | Domain |
---|---|---|---|---|
Government | As published by the Office for National Statistics, what was the CPIH annual inflation rate for all items (2015=100) in the United Kingdom in March 2025? Provide only the number as a decimal with one digit after the decimal point, without percent symbols or other units. | https://www.ons.gov.uk | 3.4 | ons.gov.uk |
Energy | As reported by the U.S. Energy Information Administration, what was the average retail price of regular gasoline in California during the week of March 24, 2025, in dollars per gallon? Provide only the number as a decimal with three digits after the decimal point, without currency symbols, commas, or other units. | https://www.eia.gov | 4.418 | eia.gov |
Markets | As reported by Cox Automotive, what was the total number of unsold used vehicles in the United States as of March 31, 2025? Provide only the number as a decimal with two digits, in millions, without commas or other units. | https://www.coxautoinc.com | 2.14 | coxautoinc.com |
Banking | As reported by the Federal Reserve Bank of New York, what was the effective federal funds rate on January 10, 2025? Provide only the number as a decimal with two digits, without percent symbols or other units. | https://www.newyorkfed.org | 4.33 | newyorkfed.org |
We evaluate several state-of-the-art LLM agents on EconWebArena across major economic domains. The table below reports task success rates (SR) and average number of steps on successful tasks. Human performance is shown for comparison.
Category | Tasks | o4-mini | GPT-4.1 | GPT-4o | Claude-4† | Gemini-2.5‡ | Llama-4§ | Human |
---|---|---|---|---|---|---|---|---|
Banking | 60 | 41.7% | 23.3% | 18.3% | 38.3% | 28.3% | 21.7% | 95.0% |
Finance | 21 | 33.3% | 14.3% | 14.3% | 23.8% | 33.3% | 9.5% | 95.2% |
Government | 138 | 57.2% | 45.7% | 35.5% | 47.1% | 39.1% | 26.1% | 91.3% |
Labor | 24 | 20.8% | 0.0% | 8.3% | 12.5% | 4.2% | 4.2% | 91.7% |
Markets | 60 | 48.3% | 35.0% | 33.3% | 41.7% | 33.3% | 15.0% | 96.7% |
Other* | 57 | 42.1% | 24.6% | 21.1% | 31.6% | 22.8% | 12.3% | 93.0% |
All SR | 360 | 46.9% | 31.9% | 26.9% | 38.6% | 31.1% | 18.9% | 93.3% |
Steps | – | 8.99 | 7.23 | 7.77 | 11.77 | 9.29 | 9.54 | – |
†Claude Sonnet 4. ‡Gemini 2.5 Flash. §Llama 4 Maverick. *Other: Energy, RealEstate, Trade, Education, and Health.
A short video summary of EconWebArena, its task design, and evaluation results.
EconWebArena provides a challenging benchmark for evaluating agents on complex economic tasks in real-world web environments, revealing current models’ limitations in grounding, navigation, and multimodal reasoning.
@article{liu2025econwebarena,
title={EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments},
author={Liu, Zefang and Quan, Yinzhu},
journal={arXiv preprint arXiv:2506.08136},
year={2025}
}