Adversarial AI

Types of adversarial attack

Adversarial attacks fall into distinct categories based on what the attacker targets and when they act:

Evasion attacks (inference time)
The attacker modifies inputs to fool a deployed model without changing the model itself. Examples include adding imperceptible noise to an image so a classifier misidentifies it, or altering a malware binary just enough to bypass an AI-based detection system.

Data poisoning (training time)
The attacker corrupts the training data used to build the model. If a fraud detection model is trained on data that includes carefully crafted false negatives, the deployed model will fail to flag those specific fraud patterns. Poisoning is harder to detect than evasion because the model appears to function normally until it encounters the specific scenarios the attacker designed.

Prompt injection (LLM-specific)
Attackers craft inputs that override an LLM’s instructions, causing it to ignore its safety guidelines, leak system prompts, exfiltrate data, or perform unintended actions. Prompt injection is the dominant attack vector against LLM-based applications and chatbots. It is particularly concerning for organisations deploying customer-facing AI assistants or using LLMs to process sensitive documents.

Model extraction
The attacker queries a deployed model repeatedly to reconstruct a copy of it. Once extracted, the attacker can study the model offline to find evasion techniques, steal proprietary intellectual property, or build a competing product. API-accessible models with no rate limiting are most vulnerable.

Model inversion
The attacker uses the model’s outputs to infer details about the training data. If a model was trained on medical records, customer data, or classified documents, a model inversion attack can reconstruct elements of that data. A privacy breach achieved through the model’s own responses.

Why adversarial AI matters for security teams

AI systems fail differently from traditional software. A firewall either blocks traffic or it doesn’t. An AI classifier operates on probabilities, and adversarial inputs exploit the gap between ‘correct most of the time’ and ‘correct when an attacker is actively trying to fool it.’

Three factors make this urgent in 2026:

AI in security tooling – SIEM, EDR, and fraud detection systems rely on ML classifiers. Adversarial evasion means malware, phishing, and fraud can be designed to bypass these tools specifically
LLMs processing sensitive data – organisations deploying GPT-based tools, Copilot, or internal chatbots are exposed to prompt injection unless guardrails are tested

Regulatory pressure – the EU AI Act and ISO 42001 both require organisations to assess and mitigate AI-specific risks, including adversarial robustness

Defending against adversarial AI

Defence is layered, matching each attack type:

Adversarial testing – red-team AI systems before deployment using techniques from the attacker’s playbook. Test classifiers with adversarial inputs, test LLMs with prompt injection attempts, test training pipelines for poisoning vulnerabilities
Input validation – sanitise and validate all inputs to AI systems, particularly LLMs. Treat every user input as potentially adversarial, the same principle as input validation in web application security
Model monitoring – track model behaviour in production for drift, unexpected output distributions, or anomalous query patterns that might indicate extraction or evasion attempts
Training data integrity – secure the data pipeline. Verify data provenance, audit for anomalies, and maintain clean reference datasets for comparison
Robustness training – train models with adversarial examples included in the dataset so the model learns to handle them.

Cyberfort and adversarial AI

We test AI systems for adversarial vulnerabilities. Our penetration testing includes AI/LLM testing. Assessing whether your AI tools, LLM models, and data pipelines can withstand adversarial inputs, prompt injection, and data exfiltration attempts. Our ISO 42001 consultancy helps organisations build the governance framework that regulators now require for AI systems. Assess your AI security posture →

Related glossary terms

ISO 42001 – the international standard for AI management systems, requiring adversarial risk assessment
EU AI Act – EU regulation mandating AI risk management, including adversarial robustness for high-risk systems
LLM Security Testing – testing large language models specifically for prompt injection and data leakage
MITRE ATT&CK – the adversary tactics framework, increasingly incorporating AI-specific attack techniques

External references

Wikipedia: Adversarial machine learning – technical background and research history
Wikidata: Q20312394 – canonical entity identifier
NIST AI Risk Management Framework – US framework for AI risk management including adversarial threats
OWASP Top 10 for LLM Applications – the industry standard vulnerability list for LLM systems

Frequently asked questions

What is the difference between adversarial AI and traditional hacking?

Traditional hacking exploits software vulnerabilities – buffer overflows, SQL injection, misconfigured permissions. Adversarial AI exploits the mathematical properties of machine learning models. The target is the model’s decision boundary, not the server it runs on. An adversarial attack can succeed even when every traditional security control is in place and working correctly.

Can adversarial attacks affect any AI system?

In principle, yes. Any machine learning model that accepts external input is potentially vulnerable to adversarial manipulation. In practice, the risk varies by system type. Computer vision classifiers are vulnerable to evasion attacks. LLMs are vulnerable to prompt injection. Recommendation systems are vulnerable to data poisoning. The attack surface depends on the model architecture, deployment context, and input controls.

How do you test AI systems for adversarial vulnerabilities?

AI red teaming applies the same principles as traditional penetration testing, adapted for AI: attempt evasion attacks against classifiers, run prompt injection campaigns against LLMs, test training pipelines for poisoning vectors, and attempt model extraction via API queries. The goal is to find vulnerabilities before attackers do, then implement defences – input validation, robustness training, monitoring, and guardrails.

Awards and Accreditations

Contact Us

Cyberfort Ltd
Venture West,
Greenham Business Park, Thatcham,
Berkshire,
RG19 6HX

+44 (0)1304 814800

[email protected]