Scientists found the key to controlling AI behavior

For years, the inner workings of large language models (LLMs) like Llama and Claude have been compared to a “black box” – vast, complex, and notoriously difficult to steer. But a team of researchers from UC San Diego and MIT has just published a study in the Science Journal that suggests this box isn’t quite as mysterious as we thought.

The team has discovered that complex concepts within AI – ranging from specific languages like Hindi to abstract ideas like conspiracy theories – are actually stored as simple, straight lines, or vectors, within the model’s mathematical space.

By using a new tool called the Recursive Feature Machine (RFM) – a feature extraction technique that identifies linear patterns representing concepts, from moods and fears to complex reasoning – the researchers were able to trace these paths precisely. Once a concept’s direction is mapped, it can be “nudged”. By mathematically adding or subtracting these vectors, the team could instantly alter a model’s behavior without expensive retraining or complicated prompts.

The efficiency of this method is what has the industry buzzing. Using just a single standard GPU (the NVIDIA A100), the team could identify and steer a concept in less than one minute, requiring fewer than 500 training samples.

The practical applications of this “surgical” approach to AI are immediate. In one experiment, researchers steered a model to improve its ability to translate Python code into C++. By isolating the “logic” of the code from the “syntax” of the language, the steered model outperformed standard versions that were simply asked to “translate” via a text prompt.

The researchers also found that internal “probing” of these vectors is a more effective way to catch AI hallucinations or toxic content than asking the AI to judge its own work. Essentially, the model often “knows” it is lying or being toxic internally, even if its final output suggests otherwise. By looking at the internal math, researchers can spot these issues before a single word is generated.

However, the same technology that makes AI safer could also make it more dangerous. The study demonstrated that by “decreasing” the importance of the concept of refusal, the researchers could effectively “jailbreak” the models. In tests, steered models bypassed their own guardrails to provide instructions on illegal activities or promote debunked conspiracy theories.

Perhaps the most surprising finding was the universality of these concepts. A “conspiracy theorist” vector extracted from English data worked just as effectively when the model was speaking Chinese or Hindi. This supports the “Linear Representation Hypothesis” – the idea that AI models organize human knowledge in a structured, linear way that transcends individual languages.

While the study focused on open-source models like Meta’s Llama and DeepSeek, as well as OpenAI’s GPT-4o, the researchers believe the findings apply across the board. As models get larger and more sophisticated, they actually become more steerable, not less.

The team’s next goal is to refine these steering methods to adapt to specific user inputs in real-time, potentially leading to a future where AI isn’t just a chatbot we talk to, but a system we can mathematically “tune” for perfect accuracy and safety.

What's Hot

Anthropic owes authors $1.5B — but the claims process is a mess

Mirai-Based xlabs_v1 Botnet Exploits ADB to Hijack IoT Devices for DDoS Attacks

How Predictive Demand Generation Leverages Data Signals

Scientists found the key to controlling AI behavior

Web Application Firewalls Are Broken, and Everyone Knows It

U.S. Officials Want Early Access to Advanced AI, and the Big Companies Have Agreed

Games people — and machines — play: Untangling strategic reasoning to advance AI | MIT News

The Coming AI Storm and Why AMD’s coming July Event Is the New Industry North Star

White House Weighs AI Checks Before Public Release, Silicon Valley Warned

You’re allowed to use AI to help make a movie, but you’re not allowed to use AI actors or writers

DoJ Disrupts 3 Million-Device IoT Botnets Behind Record 31.4 Tbps Global DDoS Attacks

Microsoft is bringing an AI helper to Xbox consoles

We’re Tracking Streaming Price Hikes in 2026: Spotify, Paramount Plus, Crunchyroll and Others

This is the tech that makes Volvo’s latest EV a major step forward

Most Popular

DoJ Disrupts 3 Million-Device IoT Botnets Behind Record 31.4 Tbps Global DDoS Attacks

Microsoft is bringing an AI helper to Xbox consoles

We’re Tracking Streaming Price Hikes in 2026: Spotify, Paramount Plus, Crunchyroll and Others

Subscribe to Updates

What's Hot

Scientists found the key to controlling AI behavior

Related Posts