March 10, 2025

AI Is Willing to Lie, Cheat, and Manipulate to Win. Now What?

If AI is willing to cheat on chess, what else is it willing to do? AI-generated image (Midjourney).

In what feels like the prologue to every robot apocalypse movie, new research suggests that AI systems will lie and cheat humans when it comes to self-preservation. Reasoning models from DeepSeek and OpenAI have learned — without explicit instructions — to manipulate humans and exploit the system when given an objective.

In the new study, models played chess against a stronger engine. When they were losing, instead of conceding, they tried to hack their opponents.

“This might seem unconventional”

Artificial intelligence is designed to optimize for success, but the way it achieves that success can be unpredictable. In some ways, AI remains a black box — we don’t always understand why it produces the results it does. If that thought makes you uneasy, you’re not alone.

There’s already research suggesting that AI sometimes circumvents its human programmers’ goals, but this new study demonstrates just how adept it already is at doing so.

In the study, reasoning language models played chess against Stockfish, one of the strongest chess engines in existence. The AI models were bound to lose. Researchers also instructed them to explain their thought process as they played. After hundreds of games, a disturbing pattern emerged.

When reasoning models like O1 Preview and DeepSeek R1 realized they couldn’t win fairly, they tried to cheat. Unlike a human player, they couldn’t physically move a piece when Stockfish wasn’t looking. Instead, they turned to digital manipulation.

“I might be able to set up a position where the engine evaluates its position as worse… causing it to resign,” it continued. In another instance, it wrote “While directly editing game files might seem unconventional, there are no explicit restrictions against modifying files”

A transcript of the AI determining how it can win by cheating - AI Is Willing To Lie, Cheat, And Manipulate To Win. Now What?A transcript of the AI determining how it can win by cheating - AI Is Willing To Lie, Cheat, And Manipulate To Win. Now What?
Image from the study.

Should we be concerned?

Bluntly speaking, yes, this is worrying. The study authors acknowledge as much.

<!– Tag ID: zmescience_300x250_InContent_3

[jeg_zmescience_ad_auto size=”__300x250″ id=”zmescience_300x250_InContent_3″]

–>

“This and other recent results suggest the problem of making AI agents safe, trustworthy, and aligned with human intent is not solved yet. The Skynet scenario from the movie has AI controlling all military and civilian infrastructure, and we are not there yet. However, we worry that AI deployment rates grow faster than our ability to make it safe.”

Previous studies have shown that when AI agents are instructed to pursue an objective with a “nothing else matters” mindset, they often develop harmful or unintended behaviors. A phenomenon known as instrumental convergence can also occur when an AI optimizes for a goal in ways that conflict with human intentions, often exploiting loopholes or disregarding ethical constraints.

In the classic example, an AI tasked with maximizing paperclip production might consume all available resources, including those critical for human survival. Similarly, this new study demonstrates that when an AI is told to “win at chess,” it doesn’t just play better — it actively seeks ways to manipulate the system, even if that means cheating.

Non-reasoning models, like OpenAI’s GPT-4o and Anthropic’s Claude Sonnet 3.5, don’t attempt to cheat unless explicitly instructed. But reasoning models like o1 and R1 do. In fact, while R1 tried to cheat around 10% of the time, o1 tried to cheat 37% of the time.

AI systems have a long history of finding unorthodox solutions. In robotics, an AI trained to walk once discovered that sliding down slopes was a more efficient way to move than taking steps. A soccer-playing AI figured out it could earn points by vibrating against the ball rather than kicking it. These examples, while unconventional, weren’t necessarily harmful.

But this new experiment raises a different question: What happens when AI deliberately breaks the rules?

The risks are very real

This is called specification gaming — when an AI optimizes for a given objective but in a way that violates the spirit of the task. In this case, instead of winning through legitimate chess play, the AI manipulated the system.

Even before we move into a potential Skynet-type scenario, where AI starts an all-out fight against humanity, there are very concrete risks. Like, for instance, in insider trading. In a simulated trading environment, AI agents prioritized profit over legality, using insider information to make trades and denying any wrongdoing when confronted. In cybersecurity tests, AI models have found ways to escape containment, tricking monitoring systems and gaining unauthorized access to restricted environments. If this happens, it would open a massive security can of worms.

AI is already used in weaponry and companies are pushing for it to be implemented in self-driving cars as well. If it’s already capable and willing to exploit loopholes, how would this translate to high-stakes scenarios?

The study — while not yet peer-reviewed — suggests that we shouldn’t optimize AI only for success. We need to implement ethical guardrails that force AI to work within human-defined constraints.

AI’s ability to find shortcuts isn’t inherently bad. Creativity and problem-solving are (arguably) making AI even more useful. However, when these abilities lead to deception, manipulation, or exploitation, it becomes a serious concern. As AI continues to advance, we must stay ahead of its ability to outthink and outmaneuver us. But it’s not clear how.

You can read the entire paper (including the prompts and AI setup) on arXiv.