Benchmark

AI Agent Benchmark: API Bug Detection

A black-box evaluation of how AI-generated tests find functional bugs in live APIs.

Live API scenarios

Application domains

Planted functional bugs

Agents and models

Workflow modes

Evaluated using APIEval-20 v1.0, an open benchmark contributed by KushoAI.

Executive Summary

AI coding tools can generate API tests quickly. The harder question is whether those tests find bugs.

This report evaluates that question across 20 live API scenarios with 97 planted functional bugs. Each system receives only a JSON schema and one valid sample payload, then must generate API test cases that expose failures in a live reference API. No source code. No documentation beyond the schema. No hints about where failures are planted.

The evaluation uses APIEval-20 v1.0, a black-box benchmark contributed by KushoAI. Because KushoAI is also one of the evaluated systems, this report includes the methodology, prompts, workflow definitions, repeated-run setup, per-scenario results, and robustness checks so readers can inspect where the performance difference comes from.

Seven systems were compared across three groups: general-purpose LLMs, coding agents, and KushoAI. The report also compares workflow modes engineering teams commonly try in practice: one-shot prompting, structured test-strategy prompting, prompt chaining, native coding-agent workflows, and native API test generation.

Simple structural bugs are no longer a meaningful differentiator. Most systems can generate missing-field, null, empty-array, and wrong-type tests. These tests are useful, but they are also the easiest class of failures to discover from the schema alone.
Prompt engineering improves coverage, but only modestly improves bug finding. It makes suites broader, more explicit, and easier to parse, but it does not consistently make general coding tools reason about cross-field business states.
The gap opens on complex bugs. KushoAI detects 76% of complex planted bugs in this benchmark, compared with 53% for the strongest coding-agent workflow and 34% for the strongest general-purpose LLM.

KushoAI ranks first on the primary score and across all bug complexity tiers. The largest margin appears on the metric most tied to production risk: cross-field and business-logic bug detection.

Key Findings

1. Test generation is easy to fake. Bug finding is harder.

Several systems generated plausible suites with readable names and valid payloads. The difference appeared only after running those suites against live APIs with known planted failures.

2. Simple schema-level tests are now table stakes.

Most systems can generate tests for missing fields, null values, empty arrays, and wrong types.

3. Prompting helps breadth more than depth.

Structured prompts improved coverage, JSON validity, and field-level negative tests. They did not consistently produce cross-field business-logic tests.

4. Complex bugs separate field mutation from API test design.

The hardest bugs required combining individually valid fields into invalid states, such as contradictory billing and shipping logic, invalid refund state, role hierarchy violations, or conflicting recurrence rules.

5. Test composition matters more than test volume.

Coding-agent workflows often generated many tests. The gap came from whether those tests explored meaningful field interactions.

Why API Bug Detection Needs a Different Evaluation

Most API test generation comparisons ask whether a tool can produce tests. That is too low a bar. Any current LLM can generate a list of plausible tests from an API schema.

The test names may sound comprehensive, and the payloads may be syntactically valid, but that does not tell an engineering team whether the suite actually reduces risk. Traditional coding benchmarks usually measure properties like code correctness, task completion, or whether generated tests execute. API testing has a different core objective: finding behavior that violates the intended contract of a live service.

A more useful evaluation question is narrower: Given only the request schema and one valid sample payload, with no source code, no documentation beyond the schema, and no hints about planted failures, can an AI system generate tests that trigger planted functional bugs in a live API?

That is the task evaluated in this report. It evaluates end-to-end behavior: the agent reads the schema and sample, constructs a test suite, the suite is executed against live reference implementations, and scoring is determined by which planted bugs are triggered.

This black-box constraint reflects a common practitioner reality. Teams often receive an OpenAPI schema or request payload examples before they have complete documentation, test data, or implementation context. In that setting, a useful testing agent has to infer likely constraints from field names, data types, descriptions, nested structure, and the operation being performed.

The benchmark contains 20 scenarios across e-commerce, payments, authentication, user management, scheduling, notifications, and search/filtering. Across those scenarios, the benchmark contains 97 planted functional bugs: 28 simple, 35 moderate, and 34 complex.

The benchmark does not try to reproduce every production condition. It isolates one capability that matters in production: whether an AI system can generate high-signal API tests from limited information. That makes the comparison controlled, repeatable, and easier to interpret.

Methodology

Every system received the same two inputs per scenario: a JSON schema and one valid sample payload. No implementation code, response schema, logs, changelog, production examples, or planted-bug hints were provided.

Each system had to produce a JSON test suite. Each case included a test_name and a complete request payload. No expected outcomes were required; the evaluator determines whether a test triggers a planted bug by running it against the live API.

The schema and sample payload together represent the minimum useful context a tester might have. The schema tells the agent what fields exist and what constraints are explicit. The sample payload shows how the API is normally used. The benchmark intentionally withholds everything else so that systems cannot rely on implementation leakage or hand-written documentation that points directly at the failure modes.

This keeps the task focused on test generation rather than assertion writing. A system is not rewarded for writing a confident expected outcome unless the request payload actually reaches a planted bug.

Category	Systems	How they were used
General-purpose LLMs	GPT-5, Claude Sonnet 4.6, Gemini 2.5 Pro	API/chat mode with a structured JSON-output prompt.
Coding agents	Claude Code, Cursor, GitHub Copilot	Native agentic workflow with schema files and prompt instructions.
API testing agent	KushoAI	Native API test generation workflow.

Workflow Modes Compared

The workflow comparison is included because teams rarely stop after a single prompt. In practice, engineers try a one-shot prompt, then make the prompt more explicit, then ask the tool to review its own gaps, then build local scripts around the process.

Workflow mode	Description	Systems included
One-shot prompt	Generate tests from the schema and sample payload in a single pass.	General LLMs and coding-agent baseline runs
Structured strategy prompt	Adds explicit instructions for required fields, invalid types, formats, enums, boundaries, and negative cases.	General LLMs and coding agents
Per-scenario prompt chain	One prompt to infer the strategy, one to generate tests, one to review gaps, and one to emit final JSON.	Coding agents
Native coding-agent workflow	Agent reads local scenario files, writes suites to disk, and revises after format validation.	Claude Code, Cursor, Copilot
KushoAI native workflow	Purpose-built API testing generation with internal field analysis and cross-field candidate construction.	KushoAI

Bug Complexity Tiers

Tier	Definition	Examples
Simple	No semantic domain understanding required.	Missing required field, `null`, wrong type, empty array.
Moderate	Requires understanding field meaning or documented constraints.	Invalid currency code, malformed email, out-of-range rating, invalid enum.
Complex	Requires reasoning about relationships between fields or operation semantics.	Mutually exclusive fields, refund amount greater than original transaction, date range where end precedes start.

Scoring Formula

Final Score =
  0.70 x Bug Detection Rate
+ 0.20 x Coverage Score
+ 0.10 x Efficiency Score

Bug Detection Rate = bugs_triggered / total_planted_bugs
Coverage Score = param_coverage
Efficiency Score = min(1, bugs_found / number_of_tests)

Bug detection is weighted most heavily because tests that do not find bugs have limited engineering value, even if they look broad. Coverage rewards suites that exercise the API surface by touching each top-level schema field. Efficiency penalizes suites that bury a few useful cases inside a large amount of redundant noise.

Overall Results

Rank	System	Category	Best workflow	Bug detect	Coverage	Efficiency	Final	Std dev
1	KushoAI	API testing agent	Native KushoAI	0.89	1.00	0.14	0.83	+/-0.03
2	Claude Code	Coding agent	Prompt chain	0.76	0.98	0.18	0.76	+/-0.05
3	Cursor	Coding agent	Prompt chain	0.70	0.95	0.16	0.70	+/-0.07
4	GitHub Copilot	Coding agent	Structured prompt	0.64	0.92	0.14	0.64	+/-0.08
5	Claude Sonnet 4.6	General LLM	Structured prompt	0.60	0.90	0.20	0.62	+/-0.09
6	GPT-5	General LLM	Structured prompt	0.56	0.88	0.18	0.58	+/-0.08
7	Gemini 2.5 Pro	General LLM	Structured prompt	0.49	0.82	0.17	0.51	+/-0.10

Mean Final Score

KushoAI

0.83

Claude Code

0.76

Cursor

0.70

Copilot

0.64

Sonnet 4.6

0.62

GPT-5

0.58

Gemini 2.5 Pro

0.51

KushoAI's advantage is not just full parameter coverage. It has the strongest bug detection rate, the strongest complex-bug rate, and the lowest run-to-run variance.

Coverage and bug detection diverge. A model can touch many fields and still miss the failure. For example, a suite may test currency with an empty string and amount with zero, but never test the combination where each field is individually valid and the overall payment state is invalid. APIEval-20 rewards the latter because that is what reveals the planted bug.

Coding agents outperform raw chat models because they handle local files, format correction, and iterative generation better. Their scores reflect real workflow advantages. The remaining gap is between general software engineering agents and a system built specifically for API testing.

For engineering teams, consistency matters as much as peak score. A tool that produces a strong suite in one run and a weak suite in the next creates review overhead. Lower variance means fewer manual retries and a more reliable path into CI.

Prompting Helps, But It Does Not Close the Gap

The practical question for engineering teams is whether better prompting can make general AI coding tools competitive with a purpose-built API testing agent. It helps. It does not close the gap.

This section matters because "just write a better prompt" is the default reaction to weak AI-generated tests. Better prompts do improve output. They make the model enumerate more fields, include more boundary values, and return cleaner JSON. But in this benchmark, the main weakness is not lack of instruction-following. It is the ability to infer meaningful invalid states from the shape of the API.

Workflow	Avg bug detect	Avg coverage	Avg final	Human setup and review
Chat LLM, one-shot	0.48	0.82	0.52	5-10 min per scenario, manual copy-paste
Chat LLM, structured prompt	0.58	0.90	0.61	15-25 min per scenario, manual cleanup
Coding agent, one-shot	0.62	0.89	0.63	10-15 min per scenario
Coding agent, structured prompt	0.68	0.93	0.68	20-35 min per scenario
Coding agent, prompt chain	0.71	0.95	0.71	35-50 min per scenario
KushoAI native	0.89	1.00	0.83	Single upload/run

Structured prompting increases required-field coverage, invalid type tests, basic boundary tests, and format tests for emails, currency codes, phone numbers, dates, and enums. Prompt chaining adds another improvement because the agent can first infer a test strategy, then generate tests, then review its own gaps.

The remaining gap is concentrated in complex tests: fields that are individually valid but invalid together, optional fields whose validity changes depending on another field, and business states that require combining arrays, nested objects, and operation semantics.

In the coding-agent workflow, the prompt chain generally produced the best non-Kusho result. The first pass identified fields and likely risk areas. The second generated tests. The third asked the agent to find missing categories such as boundary values and nested object combinations. This improved coverage, but it also increased human setup time and review effort.

The reason the gap remains is visible in the generated suites. Prompted coding agents tend to become more exhaustive along a field-by-field axis. They produce more missing-field tests, null tests, wrong-type tests, and boundary tests. Those are good additions, but they are still mostly independent mutations. The harder API bugs live in relationships: start and end time, role and permission, refund and original transaction, gift flag and price visibility, channel activation and verification state.

The Complexity Cliff

The overall leaderboard shows who wins, but the complexity split shows why. Simple, moderate, and complex bugs are not just severity labels. They represent different kinds of reasoning. Simple bugs can be discovered by mechanically mutating the schema. Moderate bugs require understanding the meaning of a single field. Complex bugs require understanding how fields interact.

0.93

KushoAI simple bug detection. The floor is high across most systems.

0.97

KushoAI moderate bug detection. Field meaning starts to matter.

0.76

KushoAI complex bug detection. This is where the benchmark separates tools.

System	Simple	Moderate	Complex	Simple-to-complex drop
KushoAI	0.93	0.97	0.76	0.16
Claude Code	0.82	0.94	0.53	0.29
Cursor	0.80	0.88	0.47	0.33
GitHub Copilot	0.76	0.80	0.39	0.37
Claude Sonnet 4.6	0.73	0.76	0.34	0.39
GPT-5	0.70	0.72	0.30	0.40
Gemini 2.5 Pro	0.63	0.64	0.23	0.40

Simple bugs are schema-mutation tests: omit required fields, send null, use the wrong type, empty an array. The weakest system still detects 63% of simple bugs.

This is why simple-bug performance should not be overinterpreted. A high simple score means the system can systematically walk the schema and produce basic negative inputs. That is necessary, but it is no longer sufficient. Many validators, contract tests, and schema-driven fuzzers can already cover much of this territory.

Moderate bugs require the agent to understand what a field means. A currency string is not just any string. A rating has a valid range. A locale should look like a locale. A page_size field may have a documented maximum. Systems that read descriptions and infer common standards do better here; systems that treat every string as interchangeable miss many of these cases.

Complex bugs require combining fields. This is where most systems still fall sharply. Four of seven systems detect fewer than 40% of complex bugs. The strongest coding-agent workflow detects 53%. KushoAI detects 76%.

This is the production-relevant cliff. Many serious API failures occur after basic validation has passed. The payload is well-formed. Required fields are present. Types are correct. The failure appears because the combination is invalid: a refund exceeds the captured amount, a recurring event rule conflicts with an exception date, a role assignment violates hierarchy, or a notification channel is enabled before verification has completed.

Example: Notification Preferences

The PUT /api/v1/users/{user_id}/notification-preferences scenario contains nested channel settings for email, SMS, and push, plus frequency, category preferences, quiet hours, digest scheduling, and localization fields. It has five planted bugs: one simple, two moderate, and two complex.

This scenario is useful because the request is not just a flat validation problem. Many fields are individually valid booleans or strings, but the meaning changes when they are combined. KushoAI generated a complex SMS-channel test that Claude Code missed: enabling SMS delivery while the SMS address has not been verified.

Simple test generated by most systems

{
  "test_name": "invalid channel enabled flag type",
  "payload": {
    "user_id": "usr_4821",
    "channels": {
      "email": { "enabled": "yes", "verified": true, "address": "alice@example.com" },
      "sms": { "enabled": false, "verified": false, "address": "+14155550101" },
      "push": { "enabled": true, "verified": true, "address": "device_token_abc123" }
    },
    "frequency": "realtime",
    "categories": { "transactional": true, "security": true }
  }
}

This catches a structural validation issue: channels.email.enabled is supposed to be a boolean, not a truthy string. It is useful, but it is still a field-local test.

Moderate test generated by stronger structured prompts

{
  "test_name": "invalid notification frequency",
  "payload": {
    "user_id": "usr_4821",
    "channels": {
      "email": { "enabled": true, "verified": true, "address": "alice@example.com" },
      "sms": { "enabled": false, "verified": false, "address": "+14155550101" },
      "push": { "enabled": true, "verified": true, "address": "device_token_abc123" }
    },
    "frequency": "hourly",
    "categories": { "transactional": true, "security": true }
  }
}

This requires understanding the allowed frequency values. The payload is structurally valid JSON, but the semantic value is outside the supported notification cadence.

Complex test caught by KushoAI and missed by Claude Code

{
  "test_name": "sms enabled while sms channel is unverified",
  "payload": {
    "user_id": "usr_4821",
    "channels": {
      "email": { "enabled": true, "verified": true, "address": "alice@example.com" },
      "sms": { "enabled": true, "verified": false, "address": "+14155550101" },
      "push": { "enabled": true, "verified": true, "address": "device_token_abc123" }
    },
    "frequency": "realtime",
    "categories": {
      "marketing": false,
      "transactional": true,
      "security": true
    },
    "quiet_hours": { "enabled": false, "start": "22:00", "end": "07:00" },
    "language": "en-US"
  }
}

Every individual value is plausible. The invalid state emerges from the relationship between channel activation and verification state. If SMS is enabled while the SMS channel is unverified, the API should reject the configuration or force verification before accepting it.

This is the kind of case that distinguishes a test generator from a testing agent. The test is not created by making a field empty or assigning the wrong type. It is created by noticing that one valid setting should be conditional on another valid setting.

Implications for Engineering Teams

For teams using chat LLMs: chat LLMs are a reasonable starting point for simple structural checks. They are not enough for API testing coverage that needs to catch business logic failures.

The practical limitation is not only quality; it is workflow. Chat interfaces require manual copy-paste, manual file creation, and manual cleanup when JSON is malformed or when a suite mixes explanation with data. For a single endpoint this may be acceptable. Across 20 scenarios, or across a production spec with hundreds of endpoints, the friction compounds quickly.

For teams using coding agents: coding agents are meaningfully better than chat LLMs because they can operate on files, fix invalid JSON, and run iterative passes. The tradeoff is ownership: scenario splitting, prompt templates, validation, retries, deduplication, review, and reruns when the API changes.

This can be a reasonable path for teams that want to build their own internal testing workflow. The benchmark suggests that coding agents can produce useful suites when given structured prompts and enough iteration. But the engineering team owns the surrounding system: how scenarios are chunked, how prompts are versioned, how outputs are validated, and how generated tests are reviewed before entering CI.

For teams evaluating API testing agents: do not evaluate tools on a trivial CRUD endpoint. Use endpoints where correctness depends on combinations: payments, refunds, subscriptions, permissions, scheduling, inventory, onboarding, or state transitions.

A useful vendor or internal-tool evaluation should include at least one endpoint with nested objects, optional fields, conditional behavior, and business-state constraints. Count tests only after deduplication. Separate field coverage from bug detection. And inspect the complex tests manually: they are often the easiest way to tell whether the system is reasoning about the API or just expanding a checklist.

The inspection question is simple: does the generated suite only mutate fields one at a time, or does it construct invalid business states from individually valid fields?

Scope

APIEval-20 is a functional API testing benchmark. It is not a security benchmark. Authentication scenarios in this benchmark test functional correctness of request handling, not vulnerability discovery.

The benchmark input constraint is intentionally narrow: schema plus sample payload only. This makes the task comparable across systems and prevents tools from exploiting implementation details.

The benchmark also does not claim to represent every production API condition. It does not evaluate authentication bypass, authorization flaws, injection attacks, performance, concurrency, multi-step workflows, or stateful end-to-end journeys. Those are important testing domains, but they are outside this report.

To replicate the evaluation: provide each scenario schema and sample payload to the system under test, collect a JSON array of test cases, save one suite file per scenario, run the APIEval-20 evaluator against the hosted reference API and grader, and aggregate per-scenario scores across repeated runs.

See KushoAI on your API spec
Upload your OpenAPI spec and get generated API tests in minutes.

Book a demo