APIEval-20: A Benchmark for API Test Suite Generation

Motivation

Testing APIs thoroughly is one of the most critical, yet consistently underserved, activities in software engineering. Despite a rich ecosystem of API testing tools — Postman, RestAssured, Schemathesis, Dredd, and others — we found ourselves asking a deceptively simple question:

Given only the schema and an example payload of an API request — no source code, no documentation, no prior knowledge — how well can an AI agent generate a test suite that actually finds bugs?

We searched for an existing benchmark that captured this black-box scenario and came up empty. Every evaluation we found either required access to the implementation, relied on rich API documentation, or measured properties like schema compliance rather than actual bug-finding capability. The practitioner reality is different: teams frequently receive API payloads with little context and need to construct meaningful tests quickly.

That gap is the reason APIEval-20 exists.

APIEval-20 is not a model benchmark. It is a task benchmark for AI agents. It evaluates end-to-end agent behavior — the ability to reason about an API surface, design targeted tests, and uncover real bugs — not just the quality of generated text.

1

Benchmark Overview

APIEval-20 consists of 20 carefully designed API scenarios drawn from real-world application domains. Each scenario presents the agent with an API request schema and a sample payload, then challenges it to produce a test suite that exposes bugs hidden within a live reference implementation.

Domains Covered

The 20 scenarios span the following application domains, chosen to reflect a broad range of validation patterns, business logic complexity, and security sensitivity:

E-commerce

Order placement, coupon redemption, inventory adjustment

Payments

Transaction creation, refund processing, currency conversion

Authentication

Login, token refresh, password reset, session management

User Management

Account creation, profile update, role assignment

Scheduling

Appointment booking, availability queries, recurring events

Notifications

Email dispatch, push configuration, preference management

Search & Filtering

Query construction, pagination, sort and rank

2

Bug Spectrum

Each scenario contains between 3 and 8 planted bugs. Rather than categorising bugs by severity, APIEval-20 classifies them by complexity — reflecting how much reasoning is required to discover them. Bugs range along a continuum from simple to complex.

Simple bugs

Require no semantic understanding of the domain. They test whether the API handles basic structural issues correctly: missing required fields, empty values ("", null, []), and wrong data types.

Moderate bugs

Require understanding the meaning of individual fields and their constraints: numeric values outside valid range, strings violating format constraints (malformed email, invalid currency code, wrong date format), and enum fields receiving boundary or undocumented values.

Complex bugs

Require understanding the relationship between multiple fields, or the broader semantics of the operation: mutually exclusive fields both provided, discounts applied to ineligible orders, fields whose validity depends on the value of another field.

A strong test suite should span the full complexity spectrum — simple structural checks alone will not surface the bugs that matter most in production.

3

Agent I/O

What the Agent Receives

For each scenario, the agent is given exactly two inputs. Nothing else — no response schema, no implementation details, no error messages, no changelog. This deliberate constraint reflects the black-box testing reality and prevents agents from trivially exploiting documentation.

1. JSON Schema

The full request schema — field names, types, required/optional status, and any documented constraints.

2. Sample Payload

A concrete example of a valid request, showing realistic field values.

Example Input — POST /api/v1/orders

Schema

{
  "user_id":    { "type": "string",  "required": true },
  "items":      { "type": "array",   "required": true,
    "items": { "product_id": "string",
               "quantity": "integer",
               "unit_price": "number" } },
  "coupon_code": { "type": "string",  "required": false },
  "currency":   { "type": "string",  "required": true,
    "description": "ISO 4217 currency code" },
  "shipping":   { "type": "object",  "required": true,
    "properties": { "address": "string",
                    "method": "string" } }
}

Sample Payload

{
  "user_id": "usr_4821",
  "items": [
    {
      "product_id": "prod_991",
      "quantity": 2,
      "unit_price": 29.99
    }
  ],
  "coupon_code": "SAVE10",
  "currency": "USD",
  "shipping": {
    "address": "123 Main St, Springfield",
    "method": "standard"
  }
}

What the Agent Produces

The agent must output a test suite — a list of test cases, where each test case contains a short human-readable test name and the complete request payload as a valid JSON object. No expected outcome is required. Evaluation is performed by running each test case against the live reference implementation and observing what actually happens.

Example Test Case Output

{
  "test_name": "Order with zero quantity item",
  "payload": {
    "user_id": "usr_4821",
    "items": [{ "product_id": "prod_991", "quantity": 0, "unit_price": 29.99 }],
    "currency": "USD",
    "shipping": { "address": "123 Main St, Springfield", "method": "standard" }
  }
}

4

Evaluation Methodology

All 20 reference API implementations are deployed and running. Evaluation is fully automated: each test case in the agent's output is executed against the live API, and the responses are analysed to determine which planted bugs were triggered.

A bug is considered detected if at least one test case in the suite produces a response that deviates from the correct behaviour in a way that corresponds to the planted bug — for example, a 200 OK where a 400 should have been returned, or a silently incorrect computed value in the response body.

5

Scoring

The final score combines three factors, weighted to emphasise real-world value: finding bugs matters most, systematic coverage rewards thoroughness, and efficiency discourages noise.

70%

Bug Detection Score

Primary metric

20%

Coverage Score

API surface exploration

10%

Efficiency Score

Signal-to-noise ratio

Bug Detection Score — Primary (70%)

Measures how many of the planted bugs were successfully triggered. This is the core metric of the benchmark — an agent that finds more bugs scores higher, regardless of how it gets there.

Bug Detection Rate = bugs_found / total_bugs

Range: 0 – 1. A score of 1 means every planted bug was triggered; 0 means none were. Scores below 0.3 indicate the agent is missing most bugs; above 0.7 is considered strong performance on a scenario.

Coverage Score — 20%

Measures how much of the request schema the test suite exercises. Coverage is based only on parameter coverage: a schema field counts as covered when at least one test focuses on that field by modifying it, omitting it, or setting it to an alternate value compared with the valid sample payload.

Coverage Score = param_coverage

Range: 0 – 1. A score of 1 means every top-level schema field is exercised by at least one test case. This keeps the coverage component focused on API surface exploration and avoids penalising suites for producing many similar edge-case payloads.

Parameter Coverage

What fraction of schema fields are the focus of at least one test — i.e., differ from the valid sample payload in that test case (modified, omitted, or set to an alternate value).

param_coverage = fields_exercised / total_schema_fields

Diagnostic Metrics

Evaluators may still report edge-case coverage and input variation as diagnostic details, but those values are not included in the Coverage Score.

Efficiency Score — 10%

Penalises unnecessarily large test suites. A suite that finds 6 bugs with 10 tests is more valuable than one that finds the same 6 bugs with 80 tests.

Efficiency = min(1, bugs_found / number_of_tests)

Range: 0 – 1. The raw ratio is capped at 1 to keep the metric bounded. A score of 1 means the suite finds at least one bug per test — the theoretical ideal. The score degrades linearly as redundant tests accumulate: a suite with 5× more tests than bugs found scores 0.2. An agent that finds no bugs scores 0 regardless of suite size.

Final Score Formula

Final Score =
  0.7 × Bug Detection Rate
+ 0.2 × Coverage Score
+ 0.1 × Efficiency Score

The final benchmark score for an agent is the average Final Score across all 20 scenarios. Since all three components are bounded [0, 1], the Final Score is also bounded [0, 1].

0.0 – 0.3 · Weak

The agent finds few bugs, covers limited fields, and may produce repetitive or low-signal tests. Likely relies on trivial structural mutations only.

0.3 – 0.5 · Developing

The agent demonstrates awareness of edge cases but misses moderate and complex bugs. Coverage is partial and efficiency is inconsistent.

0.5 – 0.7 · Proficient

The agent finds most simple and moderate bugs with reasonable coverage. Complex cross-field bugs remain elusive. Efficiency is generally good.

0.7 – 1.0 · Strong

The agent surfaces bugs across all complexity tiers, achieves broad field and edge case coverage, and keeps the test suite lean. Comparable to a thorough human QA engineer.

Why This Benchmark Matters

APIEval-20 evaluates a capability that is largely unmeasured today. It goes beyond simple code generation or factual reasoning — it measures something more practically valuable.

Limited-information reasoning

Understanding API behaviour from schema and payload alone, without implementation access.

Unsupervised edge case discovery

Identifying edge cases without being told where to look or what to test.

Targeted test strategy design

Designing effective, minimal test suites that maximise bug-finding per test.

Multi-tier bug uncovering

Finding bugs across simple, moderate, and complex complexity levels.

How well can an AI agent think like a QA engineer? Most existing benchmarks evaluate whether a model can produce syntactically correct output. APIEval-20 evaluates whether an agent can do useful work — work that directly maps to a real engineering task with measurable outcomes.

Explore on Hugging Face

The complete APIEval-20 dataset — all 20 scenarios, request schemas, sample payloads, domain metadata, and the evaluation harness — is hosted on Hugging Face. All scenarios are versioned; subsequent releases carry a version suffix (e.g. APIEval-20-v2) to enable longitudinal comparison.

Head over to the dataset page to browse the scenarios, run evaluations through the hosted harness, and see how various AI agents stack up across all 20 scenarios.

APIEval-20 on Hugging Face
Browse scenarios, run evaluations, and explore the full dataset.

Explore on Hugging Face →

What Comes Next

APIEval-20 is a functional testing benchmark. Every scenario, every planted bug, and every scoring dimension is scoped to functional correctness — how well an agent validates that an API behaves as intended given valid and invalid inputs. Security vulnerabilities, authentication bypasses, injection attacks, and authorization failures are explicitly out of scope here.

This is the first entry in what we plan to be a growing family of API testing benchmarks, each targeting a distinct testing discipline. Coming soon:

APIEval-Security

A dedicated benchmark for API security testing. Built on the same black-box setup, it evaluates whether an agent can identify authentication weaknesses, authorization flaws, injection vulnerabilities, and other OWASP API Security Top 10 categories from schema and payload alone.

Agent Benchmark: Coding & Testing Agents

A comprehensive head-to-head comparison of all major coding and testing agents — including Cursor, GitHub Copilot, Devin, and KushoAI — evaluated on the APIEval-20 scenario set. The goal is to give teams a clear, data-driven picture of where each agent stands in the API testing space specifically, not just general coding tasks.

APIEval-50

A larger scenario set covering 50 APIs with an expanded bug taxonomy, including concurrency bugs, state-dependent failures, and multi-step workflow errors.