EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Zefang Liu 1,*, Yinzhu Quan 1,*
1Georgia Institute of Technology
*These authors contributed equally to this work.

EconWebArena is a benchmark for evaluating agents on realistic economic tasks grounded in live websites. It includes 360 curated tasks from 82 authoritative sources across ten domains such as finance, labor, and trade. Tasks require agents to navigate webpages, interpret visual and structured data, and extract accurate, time-sensitive answers. Our analysis highlights persistent challenges in grounding, reasoning, and interaction for current models.

Demo Videos

These videos show agents tackling real-world economic tasks from EconWebArena on the web. Playback is shown at 5× speed.

EconWebArena

The EconWebArena benchmark covers a wide range of economic domains and task types. Each task is grounded in a real website and designed to test agents' abilities in navigation, retrieval, and reasoning.

EconWebArena overview

Website Examples

The benchmark draws from 82 authoritative websites, including government portals, financial databases, and statistical agencies. Below are representative examples of the diverse web interfaces agents interact with.

Examples of websites used in EconWebArena tasks

Task Examples

Each EconWebArena task is grounded in a real website and framed as a precise, answerable question. The examples below illustrate the diversity of domains and data formats, from inflation statistics to market prices and interest rates.

Category Task Description Start URL Answer Domain
Government As published by the Office for National Statistics, what was the CPIH annual inflation rate for all items (2015=100) in the United Kingdom in March 2025? Provide only the number as a decimal with one digit after the decimal point, without percent symbols or other units. https://www.ons.gov.uk 3.4 ons.gov.uk
Energy As reported by the U.S. Energy Information Administration, what was the average retail price of regular gasoline in California during the week of March 24, 2025, in dollars per gallon? Provide only the number as a decimal with three digits after the decimal point, without currency symbols, commas, or other units. https://www.eia.gov 4.418 eia.gov
Markets As reported by Cox Automotive, what was the total number of unsold used vehicles in the United States as of March 31, 2025? Provide only the number as a decimal with two digits, in millions, without commas or other units. https://www.coxautoinc.com 2.14 coxautoinc.com
Banking As reported by the Federal Reserve Bank of New York, what was the effective federal funds rate on January 10, 2025? Provide only the number as a decimal with two digits, without percent symbols or other units. https://www.newyorkfed.org 4.33 newyorkfed.org

Model Performance

We evaluate several state-of-the-art LLM agents on EconWebArena across major economic domains. The table below reports task success rates (SR) and average number of steps on successful tasks. Human performance is shown for comparison.

Category Tasks o4-mini GPT-4.1 GPT-4o Claude-4 Gemini-2.5 Llama-4§ Human
Banking 60 41.7% 23.3% 18.3% 38.3% 28.3% 21.7% 95.0%
Finance 21 33.3% 14.3% 14.3% 23.8% 33.3% 9.5% 95.2%
Government 138 57.2% 45.7% 35.5% 47.1% 39.1% 26.1% 91.3%
Labor 24 20.8% 0.0% 8.3% 12.5% 4.2% 4.2% 91.7%
Markets 60 48.3% 35.0% 33.3% 41.7% 33.3% 15.0% 96.7%
Other* 57 42.1% 24.6% 21.1% 31.6% 22.8% 12.3% 93.0%
All SR 360 46.9% 31.9% 26.9% 38.6% 31.1% 18.9% 93.3%
Steps 8.99 7.23 7.77 11.77 9.29 9.54

Claude Sonnet 4. Gemini 2.5 Flash. §Llama 4 Maverick. *Other: Energy, RealEstate, Trade, Education, and Health.

Summary Video

A short video summary of EconWebArena, its task design, and evaluation results.

Takeaway

EconWebArena provides a challenging benchmark for evaluating agents on complex economic tasks in real-world web environments, revealing current models’ limitations in grounding, navigation, and multimodal reasoning.

BibTeX

@article{liu2025econwebarena,
  title={EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments},
  author={Liu, Zefang and Quan, Yinzhu},
  journal={arXiv preprint arXiv:2506.08136},
  year={2025}
}