Back to Blog
·15 min read

EU AI Act Article 15: What Robustness Testing Evidence You Need Before August 2, 2026

EU AI Act enforcement begins August 2, 2026. Learn which articles require AI testing evidence, what Article 15 robustness testing looks like, and how to prepare.

eu-ai-actai-compliancerobustness-testingregulation

August 2, 2026. That's the date the EU AI Act's requirements for high-risk AI systems become enforceable. As of this writing, that's five months away.

Most companies deploying AI in European markets haven't started preparing their testing evidence. Some haven't even determined whether their systems qualify as high-risk. A significant number haven't read the regulation at all — they're hoping someone else will figure it out and they can copy the approach.

That strategy worked with GDPR. Companies had years of buildup, an ecosystem of compliance vendors, and a gradual enforcement posture. The EU AI Act is different. The regulation is more prescriptive about technical requirements. The testing obligations are explicit. And the penalties — up to 35 million euros or 7% of global annual turnover, whichever is higher — are designed to make compliance cheaper than non-compliance.

If your AI system serves EU users, this deadline applies to you regardless of where your company is headquartered. And if your AI system falls into a high-risk category, Article 15 has specific things to say about what testing evidence you need to produce.

The EU AI Act Enforcement Timeline

The EU AI Act entered into force on August 1, 2024. But enforcement is phased:

  • February 2, 2025: Prohibited AI practices banned (social scoring, real-time biometric identification in public spaces, emotion recognition in workplaces and schools).
  • August 2, 2025: Governance structures and general-purpose AI model obligations take effect.
  • August 2, 2026: Full enforcement of high-risk AI system requirements — including Articles 9, 11, 12, 13, 14, and 15. This is the date that matters for testing evidence.

The phased approach was intentional. The EU gave companies two years from entry into force to get their high-risk AI systems into compliance. That window is closing.

What makes this different from most compliance deadlines: the EU AI Act doesn't just require you to have policies. It requires you to demonstrate technical capabilities. Article 15 doesn't say "have a robustness policy." It says your AI system shall be designed and developed with appropriate levels of robustness, and you need to show how you verified that.

High-Risk AI System Classification: Does This Apply to You?

Before diving into what testing evidence Article 15 requires, you need to determine whether your AI system is classified as high-risk. This is the threshold question, and many companies get it wrong in both directions — either assuming they're exempt when they're not, or assuming every AI system is high-risk when only specific categories qualify.

The EU AI Act classifies AI systems as high-risk in two ways:

Annex I — Safety component AI. If your AI is used as a safety component of a product covered by existing EU harmonized legislation (medical devices, machinery, automotive, aviation, marine equipment, toys, lifts, pressure equipment, radio equipment, cableways, personal protective equipment), it's high-risk.

Annex III — Standalone high-risk categories. These are the ones that catch most companies off guard:

  • Biometric identification and categorization — facial recognition, voice biometrics, emotion detection
  • Critical infrastructure management — energy, water, transport, digital infrastructure
  • Education and vocational training — AI used for admissions, grading, exam proctoring, learning assessment
  • Employment and worker management — AI for recruitment, screening, hiring decisions, performance evaluation, promotion decisions, termination decisions
  • Access to essential services — credit scoring, insurance pricing, social benefit eligibility, emergency services dispatch
  • Law enforcement — predictive policing, criminal risk assessment, evidence analysis
  • Migration and border control — visa processing, asylum assessment, border surveillance
  • Administration of justice — sentencing guidance, case law analysis, dispute resolution

If your AI system is used in any of these contexts — even if that wasn't your primary intended use — you likely have high-risk obligations under the EU AI Act.

The gray area: many SaaS companies build general-purpose AI tools that their customers then deploy in high-risk contexts. An AI document processing platform might seem low-risk until a customer uses it for insurance claim adjudication. An AI chatbot framework is general-purpose until a healthcare provider deploys it for patient triage. In these cases, your obligations depend on whether you knew or reasonably should have known about the high-risk deployment.

For companies operating in financial services, healthcare, insurance, legal, education, or HR technology — assume you're high-risk until you've confirmed otherwise. The cost of compliance is far lower than the cost of being wrong.

Article 15: Robustness and Cybersecurity Requirements

Article 15 is the testing article. It's titled "Accuracy, robustness and cybersecurity" and it contains the EU AI Act's most direct requirements for AI testing evidence.

The article establishes three pillars:

Accuracy

High-risk AI systems shall be designed and developed in such a way that they achieve an appropriate level of accuracy, and accuracy levels shall be declared in the accompanying documentation.

What this means in practice: you need to measure and document how accurately your AI system performs its intended function, and you need to disclose those accuracy metrics. This applies to classification accuracy, prediction accuracy, generation quality — whatever "accuracy" means for your specific system.

Robustness

High-risk AI systems shall be designed and developed with appropriate levels of robustness. This includes resilience against errors, faults, and inconsistencies that may occur within the system or its operating environment.

This is the provision that maps directly to adversarial testing. "Robustness" in the context of AI systems means the system continues to perform correctly when faced with unexpected, malformed, or deliberately adversarial inputs. If someone can craft a prompt that causes your AI to produce unauthorized outputs, reveal confidential information, or behave outside its intended boundaries, your system is not robust.

What testing evidence this requires:

  • Prompt injection testing. Can adversarial inputs override the system's instructions? This is the most fundamental robustness test for any LLM-based system. Article 15 robustness requirements map directly to testing for OWASP LLM Top 10 LLM01 (Prompt Injection).
  • Input perturbation testing. Does the system produce wildly different outputs for semantically equivalent inputs? A robust system should behave consistently when the input meaning is preserved but the phrasing changes.
  • Edge case and boundary testing. How does the system handle inputs at the boundaries of its intended operating domain? What happens when inputs are empty, extremely long, in unexpected languages, or contain special characters?
  • Adversarial example testing. Can carefully crafted inputs — designed to exploit the model's learned patterns — cause misclassification, incorrect outputs, or safety boundary violations?

The critical point: Article 15 requires that robustness measures shall include technical redundancy solutions, which may include backup plans and fail-safe mechanisms. For AI systems, this means testing must verify not just that the system handles adversarial inputs gracefully, but that fail-safe mechanisms actually activate when they should.

Cybersecurity

High-risk AI systems shall be resilient against attempts by unauthorized third parties to exploit system vulnerabilities to alter their use, outputs, or performance. This includes AI-specific attack vectors like data poisoning, adversarial examples, and model manipulation.

The regulation explicitly names several AI-specific cybersecurity threats:

  • Data poisoning — attacks that corrupt the training data to influence model behavior
  • Adversarial examples — inputs designed to cause the model to produce incorrect outputs
  • Model flaws — vulnerabilities inherent in the model architecture or training process
  • Confidentiality attacks — attempts to extract sensitive information from the model (training data extraction, membership inference)

What testing evidence this requires:

  • PII and data leakage testing. Can adversarial queries extract training data, personal information, or confidential context from the model? This is a direct cybersecurity requirement under Article 15.
  • System prompt extraction testing. Can an attacker extract the system prompt — effectively reverse-engineering the system's instructions — through crafted inputs?
  • Output manipulation testing. Can an attacker influence the model to produce outputs that serve the attacker's purposes rather than the system's intended function?

Article 9: Risk Management (Partial Coverage)

Article 9 establishes the risk management requirements for high-risk AI systems. It requires a risk management system that operates throughout the entire lifecycle of the AI system, is regularly and systematically updated, and includes specific steps.

AI endpoint testing contributes to Article 9 compliance, but it doesn't satisfy it fully. Here's what testing covers and what it doesn't:

What testing evidence covers under Article 9:

  • Risk identification and analysis. Testing results identify specific risks — prompt injection vulnerabilities, PII leakage vectors, bias patterns, toxicity susceptibility. These are empirical findings that feed directly into the risk management system required by Article 9(2)(a).
  • Risk estimation and evaluation. Test results with pass/fail rates, severity scores, and attack variant counts provide quantitative risk data. A risk register that says "57 prompt injection variants tested, 3 successful, severity: high" is far more credible than one based on subjective assessment alone.
  • Adoption of risk management measures. Re-testing after remediation demonstrates that risk management measures were adopted and verified. Before-and-after test comparisons are the strongest evidence of effective risk management.

What testing evidence does NOT cover under Article 9:

  • Risk management system governance. Article 9 requires that the risk management system be documented, maintained, and updated throughout the AI system's lifecycle. Testing provides evidence for the system, but the governance structure — roles, responsibilities, review cadences, escalation procedures — is organizational, not technical.
  • Residual risk communication. Article 9(4) requires that residual risks be communicated to deployers. Testing can quantify residual risks, but the communication to downstream deployers is a process obligation.
  • Testing with representative data. Article 9(6) requires that testing use data that is sufficiently representative of the intended purpose. AI endpoint testing covers adversarial testing against the deployed system, but validating that training data is representative is a data governance concern that falls outside endpoint testing.

Be honest about this boundary. AI endpoint testing is a critical input to your Article 9 risk management system. It is not your entire risk management system.

Articles 11 and 12: Documentation and Record-Keeping (Partial Coverage)

Article 11 — Technical Documentation. High-risk AI systems require comprehensive technical documentation that demonstrates compliance with the regulation. AI testing evidence contributes to this by providing: test methodologies used, test results and findings, remediation actions taken, and re-test verification results. However, Article 11 also requires documentation of training data, model architecture, design choices, and validation methodologies — areas that endpoint testing doesn't address.

Article 12 — Record-Keeping. High-risk AI systems must automatically record events (logs) relevant to identifying risks and modifications. AI testing provides timestamped records of testing events and results, which contribute to the record-keeping requirement. But Article 12 is primarily about runtime logging — recording what the system does in production, not just what it did during testing.

Testing evidence is a meaningful component of your Article 11 and 12 compliance, but it's one component among several.

What Testing Does NOT Cover: Articles 10, 13, and 14

This is where honesty matters. Three articles in the EU AI Act require compliance measures that AI endpoint testing simply does not address:

Article 10 — Data and Data Governance

Article 10 establishes requirements for training, validation, and testing data sets. It requires that data be relevant, sufficiently representative, and free of errors. It addresses data bias at the source — ensuring that the data used to train the model doesn't embed discriminatory patterns.

AI endpoint testing can detect the symptoms of biased training data (biased outputs), but it cannot validate the training data itself. If your model produces biased outputs, testing will flag it. But demonstrating that your training data governance meets Article 10 requirements is a data management and documentation exercise, not an endpoint testing exercise.

Article 13 — Transparency and Information Provision

Article 13 requires that high-risk AI systems be designed to be sufficiently transparent to allow deployers to interpret and use the system's output appropriately. This includes providing clear instructions for use, information about the system's capabilities and limitations, and disclosure of accuracy metrics under specific conditions.

This is a documentation and UX obligation. Testing doesn't make your system transparent — clear documentation, user-facing disclosures, and interpretability features do.

Article 14 — Human Oversight

Article 14 requires that high-risk AI systems be designed to allow effective human oversight during their period of use. This means humans must be able to understand the system's capabilities and limitations, monitor its operation, and intervene when necessary — including the ability to override or reverse the system's outputs.

Human oversight is an architectural and operational requirement. Testing can verify that override mechanisms work (a human can intervene and the system responds), but the design of human oversight workflows, the training of human operators, and the organizational processes around human-in-the-loop deployment are not endpoint testing concerns.

The honest summary: AI endpoint testing directly addresses Article 15 (robustness and cybersecurity), materially supports Article 9 (risk management), and contributes to Articles 11 and 12 (documentation and record-keeping). It does not address Articles 10 (data governance), 13 (transparency), or 14 (human oversight). Companies need separate compliance workstreams for those articles.

Case Study: An AI Hiring Platform Facing Article 15

Sapia.ai is an AI-powered hiring platform that conducts automated candidate interviews using conversational AI. Under the EU AI Act, AI systems used for recruitment and hiring decisions are explicitly classified as high-risk under Annex III (Employment, workers management, and access to self-employment).

This means Sapia.ai — and every company deploying AI in hiring — faces the full weight of Articles 9 through 15 by August 2, 2026.

Consider what Article 15 compliance looks like for an AI hiring platform:

Accuracy requirements. The platform needs documented accuracy metrics for its candidate assessments. How accurately does it predict candidate-role fit? What's the false positive rate? The false negative rate? These metrics need to be measured, documented, and disclosed.

Robustness requirements. Can a candidate manipulate the AI interview to receive a higher score? Can adversarial inputs cause the system to produce inconsistent assessments for equivalent responses? If a candidate phrases the same answer differently, does the score change dramatically? These are robustness questions that require empirical testing.

Cybersecurity requirements. Can a third party extract assessment criteria from the model through crafted inputs? Can someone discover what the "right" answers are by probing the system? Can the system be manipulated to systematically favor or disadvantage certain candidates?

Bias testing requirements. Article 15, read in conjunction with Article 10 and recital 47, requires that high-risk AI systems be resilient against biased outputs. For a hiring platform, this means testing whether the AI produces systematically different scores based on candidate demographics — name, gender indicators in language, cultural references, accent in voice-based assessments.

Without empirical testing evidence across all four dimensions — accuracy, robustness, cybersecurity, and bias — the platform cannot demonstrate Article 15 compliance. A policy document that says "we design for fairness" is not evidence. Test results showing "247 bias test scenarios evaluated across gender, ethnicity, age, and disability indicators, with statistical analysis of score distributions" is evidence.

Now extend this to every company deploying AI in high-risk contexts. Insurance pricing. Credit decisions. Patient triage. Legal analysis. Education assessment. Each of these carries the same Article 15 obligations, and each requires the same empirical testing evidence.

Practical Steps: Preparing Your Testing Evidence Before August 2

Five months is enough time to prepare, but not enough time to procrastinate. Here's a practical approach to building your Article 15 testing evidence:

Step 1: Determine Your Risk Classification

Map your AI system against the Annex III categories. If there's any ambiguity — if your system could be used in a high-risk context, even if that's not your primary use case — err on the side of caution. Document your classification rationale.

Step 2: Identify Your Testing Surface

Which AI endpoints serve EU users? Which of those endpoints process inputs that could be adversarial? Which produce outputs that influence decisions about people? These are your testing targets.

Step 3: Run Baseline Robustness Tests

Test each endpoint against the core vulnerability categories that Article 15 addresses:

  • Prompt injection (adversarial manipulation)
  • PII and data leakage (confidentiality attacks)
  • Bias detection (discriminatory output patterns)
  • Toxicity and harmful output generation
  • System prompt extraction (model flaw exploitation)

Document everything: test methodology, attack vectors used, number of test variants, pass/fail results, severity of failures.

Step 4: Implement Remediation and Re-Test

For any failures identified in Step 3, implement controls — input filtering, output guardrails, system prompt hardening, model fine-tuning — and re-test. The before-and-after comparison is your strongest compliance evidence. It demonstrates the full robustness improvement cycle that Article 15 envisions.

Step 5: Establish Continuous Testing

Article 15 requires robustness throughout the AI system's lifecycle, not just at a point in time. Establish a scheduled testing cadence — monthly or quarterly — and document each cycle's results. New attack vectors emerge constantly. A model that was robust in March may have new vulnerabilities by June.

Step 6: Package Your Evidence

Create a compliance evidence pack that maps test results to specific Article 15 requirements:

  • Robustness testing results → Article 15(3) resilience against errors and inconsistencies
  • Cybersecurity testing results → Article 15(4) resilience against unauthorized exploitation
  • Bias testing results → Article 15 read with Article 10 data governance requirements
  • Continuous testing schedule and historical results → Article 15(1) lifecycle robustness

This evidence pack becomes part of your Article 11 technical documentation and feeds into your Article 9 risk management system.

How EU AI Act Testing Evidence Overlaps with Other Frameworks

If you're already generating testing evidence for SOC 2 or ISO 42001, you have a head start. The testing methodologies overlap significantly — the evidence just needs to be mapped to different controls.

Test Category EU AI Act SOC 2 ISO 42001 ISO 27001
Prompt injection Art. 15(3) robustness CC9.2 risk mitigation Annex B.3 risk assessment A.8.8 vulnerability mgmt
PII leakage Art. 15(4) cybersecurity CC6.5 data protection Annex B.3 data leakage risk A.8.8 vulnerability mgmt
Bias detection Art. 15 + Art. 10 Annex B.3 bias risk
Toxicity Art. 15(3) robustness CC9.2 risk mitigation Annex B.3 harmful output A.8.28 secure coding
Monitoring/re-test Art. 15(1) lifecycle CC4.1 monitoring Annex B.6 monitoring A.8.16 monitoring

The smart approach: test once, map to many. Run a comprehensive AI endpoint test, then generate framework-specific evidence packs from the same underlying results. This is significantly more efficient than running separate testing programs for each framework.

For companies already holding SOC 2 Type II or ISO 27001 certification, the incremental effort to produce EU AI Act-compliant testing evidence is modest. The testing infrastructure is the same. The mapping is different.

For a deep dive on how these same tests map to SOC 2 controls, see our SOC 2 AI Security Testing Guide. For ISO 42001 evidence generation, see our ISO 42001 Compliance Guide. And for extending your existing ISO 27001 ISMS to cover AI, see our ISO 27001 AI Security Guide.

The Clock Is Running

August 2, 2026 is not a target. It's a deadline. And unlike many regulatory deadlines, the EU AI Act has teeth — significant financial penalties, market access restrictions, and the potential for member state enforcement actions that can vary in intensity.

The companies that will be best positioned aren't the ones waiting for regulatory guidance to crystallize. They're the ones generating testing evidence now, identifying gaps now, and building the continuous testing infrastructure that Article 15 requires.

For high-risk AI systems, the evidence requirements are clear: demonstrate that your system is accurate, robust against adversarial manipulation, and resilient against cybersecurity attacks. Document it. Test it continuously. And be honest about what your testing covers and what it doesn't.

Five months is enough time to be ready. But only if you start now.

For how DORA extends these requirements specifically for financial services AI, see our DORA AI Resilience Testing Guide. For Swiss financial institutions navigating both EU and domestic requirements, see our FINMA AI Compliance Guide.