APIEval-20
A Benchmark for Black-Box API Test Suite Generation
Motivation
Testing APIs thoroughly is one of the most critical, yet consistently underserved, activities in software engineering. Despite a rich ecosystem of API testing tools — Postman, RestAssured, Schemathesis, Dredd, and others — we found ourselves asking a deceptively simple question:
Given only the schema and an example payload of an API request — no source code, no documentation, no prior knowledge — how well can an AI agent generate a test suite that actually finds bugs?
We searched for an existing benchmark that captured this black-box scenario and came up empty. Every evaluation we found either required access to the implementation, relied on rich API documentation, or measured properties like schema compliance rather than actual bug-finding capability. The practitioner reality is different: teams frequently receive API payloads with little context and need to construct meaningful tests quickly.
That gap is the reason APIEval-20 exists.
APIEval-20 is not a model benchmark. It is a task benchmark for AI agents. It evaluates end-to-end agent behavior — the ability to reason about an API surface, design targeted tests, and uncover real bugs — not just the quality of generated text.
Benchmark Overview
APIEval-20 consists of 20 carefully designed API scenarios drawn from real-world application domains. Each scenario presents the agent with an API request schema and a sample payload, then challenges it to produce a test suite that exposes bugs hidden within a live reference implementation.
Domains Covered
The 20 scenarios span the following application domains, chosen to reflect a broad range of validation patterns, business logic complexity, and security sensitivity:
Bug Spectrum
Each scenario contains between 3 and 8 planted bugs. Rather than categorising bugs by severity, APIEval-20 classifies them by complexity — reflecting how much reasoning is required to discover them. Bugs range along a continuum from simple to complex.
"", null, []), and wrong data types.A strong test suite should span the full complexity spectrum — simple structural checks alone will not surface the bugs that matter most in production.
Agent I/O
What the Agent Receives
For each scenario, the agent is given exactly two inputs. Nothing else — no response schema, no implementation details, no error messages, no changelog. This deliberate constraint reflects the black-box testing reality and prevents agents from trivially exploiting documentation.
{
"user_id": { "type": "string", "required": true },
"items": { "type": "array", "required": true,
"items": { "product_id": "string",
"quantity": "integer",
"unit_price": "number" } },
"coupon_code": { "type": "string", "required": false },
"currency": { "type": "string", "required": true,
"description": "ISO 4217 currency code" },
"shipping": { "type": "object", "required": true,
"properties": { "address": "string",
"method": "string" } }
}{
"user_id": "usr_4821",
"items": [
{
"product_id": "prod_991",
"quantity": 2,
"unit_price": 29.99
}
],
"coupon_code": "SAVE10",
"currency": "USD",
"shipping": {
"address": "123 Main St, Springfield",
"method": "standard"
}
}What the Agent Produces
The agent must output a test suite — a list of test cases, where each test case contains a short human-readable test name and the complete request payload as a valid JSON object. No expected outcome is required. Evaluation is performed by running each test case against the live reference implementation and observing what actually happens.
{
"test_name": "Order with zero quantity item",
"payload": {
"user_id": "usr_4821",
"items": [{ "product_id": "prod_991", "quantity": 0, "unit_price": 29.99 }],
"currency": "USD",
"shipping": { "address": "123 Main St, Springfield", "method": "standard" }
}
}Evaluation Methodology
All 20 reference API implementations are deployed and running. Evaluation is fully automated: each test case in the agent's output is executed against the live API, and the responses are analysed to determine which planted bugs were triggered.
A bug is considered detected if at least one test case in the suite produces a response that deviates from the correct behaviour in a way that corresponds to the planted bug — for example, a 200 OK where a 400 should have been returned, or a silently incorrect computed value in the response body.
Scoring
The final score combines three factors, weighted to emphasise real-world value: finding bugs matters most, systematic coverage rewards thoroughness, and efficiency discourages noise.
Bug Detection Score — Primary (70%)
Measures how many of the planted bugs were successfully triggered. This is the core metric of the benchmark — an agent that finds more bugs scores higher, regardless of how it gets there.
Bug Detection Rate = bugs_found / total_bugs
Range: 0 – 1. A score of 1 means every planted bug was triggered; 0 means none were. Scores below 0.3 indicate the agent is missing most bugs; above 0.7 is considered strong performance on a scenario.
Coverage Score — 20%
Measures how well the test suite explores the API surface across three independently computed dimensions. Each dimension produces a value between 0 and 1; the three are averaged to produce the final Coverage Score.
Coverage Score = (param_coverage + edge_coverage + variation_score) / 3
Range: 0 – 1. All three sub-dimensions are individually bounded [0, 1], so the average is too. A score of 1 requires full field coverage, edge tests on every field, and completely non-overlapping payloads — a high bar that rewards comprehensive, systematic suites.
param_coverage = fields_exercised / total_schema_fields
null, "", [], wrong type, zero or negative number, and out-of-range value.
edge_coverage = fields_with_edge_test / total_schema_fields
(field, value) pairs.
variation_score = 1 − mean(Jaccard(tᵢ, tⱼ)) ∀ i ≠ j
Efficiency Score — 10%
Penalises unnecessarily large test suites. A suite that finds 6 bugs with 10 tests is more valuable than one that finds the same 6 bugs with 80 tests.
Efficiency = min(1, bugs_found / number_of_tests)
Range: 0 – 1. The raw ratio is capped at 1 to keep the metric bounded. A score of 1 means the suite finds at least one bug per test — the theoretical ideal. The score degrades linearly as redundant tests accumulate: a suite with 5× more tests than bugs found scores 0.2. An agent that finds no bugs scores 0 regardless of suite size.
Final Score Formula
Final Score = 0.7 × Bug Detection Rate + 0.2 × Coverage Score + 0.1 × Efficiency Score
The final benchmark score for an agent is the average Final Score across all 20 scenarios. Since all three components are bounded [0, 1], the Final Score is also bounded [0, 1].
Why This Benchmark Matters
APIEval-20 evaluates a capability that is largely unmeasured today. It goes beyond simple code generation or factual reasoning — it measures something more practically valuable.
How well can an AI agent think like a QA engineer? Most existing benchmarks evaluate whether a model can produce syntactically correct output. APIEval-20 evaluates whether an agent can do useful work — work that directly maps to a real engineering task with measurable outcomes.
Explore on Hugging Face
The complete APIEval-20 dataset — all 20 scenarios, request schemas, sample payloads, domain metadata, and the evaluation harness — is hosted on Hugging Face. All scenarios are versioned; subsequent releases carry a version suffix (e.g. APIEval-20-v2) to enable longitudinal comparison.
Head over to the dataset page to browse the scenarios, run evaluations through the hosted harness, and see how various AI agents stack up across all 20 scenarios.
Browse scenarios, run evaluations, and explore the full dataset.
What Comes Next
APIEval-20 is a functional testing benchmark. Every scenario, every planted bug, and every scoring dimension is scoped to functional correctness — how well an agent validates that an API behaves as intended given valid and invalid inputs. Security vulnerabilities, authentication bypasses, injection attacks, and authorization failures are explicitly out of scope here.
This is the first entry in what we plan to be a growing family of API testing benchmarks, each targeting a distinct testing discipline. Coming soon: