Teaching AI agents to ask better questions by playing “Battleship”

In 2026, the hype for artificial intelligence agents is louder than ever before. These semi-autonomous programs can “think” and execute well-defined tasks in areas like customer service and software development, typically using language models (LMs). But fields like medical diagnosis and scientific discovery require them to inquire about a vast range of solutions in uncertain environments, which LMs struggle with.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University’s School of Engineering and Applied Sciences (SEAS) peered deeper into LMs to understand their main issues in high-stakes settings. Their test: “Battleship,” a classic guessing game that’s helped cognitive scientists study how humans seek information.

CSAIL and SEAS scholars added a twist by reframing the game around asking and answering natural language questions. In their “Collaborative Battleship” game, one participant is a “captain” who inquires about where hidden ships are, while their teammate plays the “spotter” by responding to those questions in real-time.

The researchers first had over 40 humans play the game together, collecting their questions and yes-no answers to build the “BattleshipQA” dataset. These results were a helpful point of comparison when the team tested state-of-the-art LMs (like GPT-5) and smaller models (like Llama 4 Scout) on their game. Without training the models beforehand, they found that top LMs can “beat” humans at “Battleship” — that is, complete the game in fewer turns — but smaller systems are far less rational.

The chief issue was that many models are simply not adept at coming up with useful questions. To get LMs to inquire in ways that reveal more information about hidden ships, the researchers gave each model a Monte Carlo inference strategy, which carefully measures the likelihood of different options being correct with each response. The result: AI models that can beat regular players at “Battleship,” regardless of scale.

Perhaps the most striking results were Llama 4 Scout’s gains. As a relatively small LM, it only beat humans 8 percent of the time. But with refinements to its inference strategy, the model reached a “Battleship” win rate of 82 percent versus humans. This careful and efficient style of asking questions also enabled the model to outpace a frontier model (GPT-5), while operating at around 1 percent of its cost.

On top of this improvement, the researchers shrank the gap between humans and LMs in answering questions. While GPT-5 was a reliable spotter that helped models finish games faster, smaller systems had a bad habit of giving the wrong answers about where ships were hidden. The models saw an accuracy boost of 15 percent on average when they began converting questions into code that explicitly tells them how to verify their answers (for example, having the model run a quick search of an area when asked if a ship was there).

“Today’s language models are primarily optimized to answer complex queries, but it’s less clear whether they learn to ask good questions for themselves,” says MIT PhD student and CSAIL researcher Gabriel Grand SM ’23, who is a lead author on a paper about the work. “Our work shows that asking informative questions depends on the ability to predict and simulate the world. We find that when we give agents access to a ‘world model,’ they ask better questions and make discoveries more efficiently.”

A sea change for LMs

The team’s first focus was getting LMs to ask better questions. By implementing Monte Carlo inference strategies, the LMs reason about potential guesses as individual particles. The ones that appear more valid with each answer from the spotter would be weighted more heavily, sort of like game balls that inflate or deflate each turn. With this more calculated, adaptive approach, the captain could make inquiries that extracted considerably more info from the spotter.

The scientists then turned to the widely used programming language Python to help out AI spotters. Each question the captain asked was automatically converted into an encoded command. For example, a question like, “Is there a ship in column one that spans two rows?” turns into instructions for the spotter LM to search the area in question and assess how wide the digital game piece is. By giving the model clear directions in a language it understands particularly well, each system gave correct answers considerably more often. The lightweight system GPT-4o-mini saw a nearly 30 percent performance bump, for instance, and even the large model Claude 4 Opus jumped about eight points.

“The field has seen a lot of success from ‘auto-formalization’ strategies, in which LMs generate code to verify their solutions,” says senior author Jacob Andreas, an MIT electrical engineering and computer science associate professor and CSAIL principal investigator. “What I find most exciting about this work is that it opens up the possibility of using these techniques to generate better solutions in the first place, by improving LMs’ exploration and information gathering capabilities. We are excited to scale this work up from scientific domains to applications like coding and mathematical problem-solving.”

Let’s play something else

But how would this approach fare in other board games? The team tested their newly equipped LMs at “Guess Who?”, where large and small models skillfully whittled down 100 options to correctly guess which hidden character had been chosen. Llama 4 Scout was successful 30 percent of the time, but after Grand and his colleagues’ tweaks, it completed the task on over 72 percent of its runs. Meanwhile, GPT-4o leapt from 62 percent to 90 percent. GPT-5 was the spotter in each game to ensure questions were answered as accurately as possible.

While LMs have made promising progress in both games, there’s room for improvement. For instance, the models still struggle to answer complex questions, compared to humans. OpenAI researcher, recent Harvard graduate, and coauthor Valerio Pepe adds that “GPT-5 can beat your average ‘Battleship’ player, and gets a hair better with our methods. However, expert players are still hard to beat for all models, unlike in chess, where even top players don’t succeed against AI systems.”

The researchers’ findings show that AI agents have untapped potential in “needle-in-a-haystack” discovery — navigating a massive space of options to find a rare solution to scientific challenges. While improved information-seeking skills would make them excellent research assistants with, say, identifying a compound’s molecular structure, the researchers caution that “Collaborative Battleship” is a somewhat simple test bed. They’d like to test LMs in more complex settings, where the systems have to consider far more options.

Grand also plans to have humans and AI models collaborate to study whether they work better together. The models might also benefit from a bit of fine-tuning on game simulations, and with more computing power, LMs would have more advanced inference capabilities to predict how a game will evolve.

“As AI systems become more agentic, the hardest problems turn out to be social ones: tracking common ground, resolving misunderstandings, and adapting to different partners over time,” says Robert Hawkins, assistant professor of linguistics at Stanford University, who wasn’t involved in the paper. “This work elegantly captures these phenomena in a controlled collaborative setting, and makes a compelling case that the real bottleneck for AI agents isn’t just the calculation of optimal questions, but the pragmatic reasoning needed to make the most of their answers.”

Grand and Pepe wrote the paper with two CSAIL principal investigators: MIT Associate Professor Jacob Andreas and MIT Professor Joshua Tenenbaum. Their work was supported, in part, by the MIT Siegel Family Quest for Intelligence, the MIT-IBM Watson AI Lab, the FinTechAI@CSAIL initiative, a Sloan Research Fellowship, Intel, the Air Force Office of Scientific Research, the Defense Advanced Research Projects Agency, the Office of Naval Research, and the National Science Foundation. They showcased their paper as an oral presentation at the International Conference on Learning Representations (ICLR) in April.

What's Hot

Home Batteries: How They’re Installed and How Much They Cost

NASA Is Testing A Rover That Can Drive Faster And Lift Its Wheels To Climb Obstacles

AI, user data and the asymmetry of understanding

Teaching AI agents to ask better questions by playing “Battleship” | MIT News

A better way to model the behavior of metal alloys | MIT News

This Is What B2B Marketers Need to Know About the Future of Work

MIT in the media: For the future of tech, “Massachusetts can absolutely lead” | MIT News

In game theory, generalists sometimes win out over specialists | MIT News

The Best EDB to PST Conversion

Could AI tell you where you left your keys? | MIT News

DoJ Disrupts 3 Million-Device IoT Botnets Behind Record 31.4 Tbps Global DDoS Attacks

Microsoft is bringing an AI helper to Xbox consoles

This is the tech that makes Volvo’s latest EV a major step forward

Why Security Validation Is Becoming Agentic

Most Popular

DoJ Disrupts 3 Million-Device IoT Botnets Behind Record 31.4 Tbps Global DDoS Attacks

Microsoft is bringing an AI helper to Xbox consoles

This is the tech that makes Volvo’s latest EV a major step forward

Subscribe to Updates

What's Hot

Teaching AI agents to ask better questions by playing “Battleship” | MIT News

Related Posts