Close Menu

    Subscribe to Updates

    Get the latest creative news from infofortech

    What's Hot

    Musk v. Altman Evidence Shows What Microsoft Executives Thought of OpenAI

    May 8, 2026

    AI Is Helping Security Teams Move from Detection to Action

    May 8, 2026

    Twitch Has New Penalties For Streamers Caught Viewbotting, CEO Says

    May 8, 2026
    Facebook X (Twitter) Instagram
    InfoForTech
    • Home
    • Latest in Tech
    • Artificial Intelligence
    • Cybersecurity
    • Innovation
    Facebook X (Twitter) Instagram
    InfoForTech
    Home»Artificial Intelligence»A better method for planning complex visual tasks | MIT News
    Artificial Intelligence

    A better method for planning complex visual tasks | MIT News

    InfoForTechBy InfoForTechMarch 14, 2026No Comments5 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    A better method for planning complex visual tasks | MIT News
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email



    MIT researchers have developed a generative artificial intelligence-driven approach for planning long-term visual tasks, like robot navigation, that is about twice as effective as some existing techniques.

    Their method uses a specialized vision-language model to perceive the scenario in an image and simulate actions needed to reach a goal. Then a second model translates those simulations into a standard programming language for planning problems, and refines the solution.

    In the end, the system automatically generates a set of files that can be fed into classical planning software, which computes a plan to achieve the goal. This two-step system generated plans with an average success rate of about 70 percent, outperforming the best baseline methods that could only reach about 30 percent.

    Importantly, the system can solve new problems it hasn’t encountered before, making it well-suited for real environments where conditions can change at a moment’s notice.

    “Our framework combines the advantages of vision-language models, like their ability to understand images, with the strong planning capabilities of a formal solver,” says Yilun Hao, an aeronautics and astronautics (AeroAstro) graduate student at MIT and lead author of an open-access paper on this technique. “It can take a single image and move it through simulation and then to a reliable, long-horizon plan that could be useful in many real-life applications.”

    She is joined on the paper by Yongchao Chen, a graduate student in the MIT Laboratory for Information and Decision Systems (LIDS); Chuchu Fan, an associate professor in AeroAstro and a principal investigator in LIDS; and Yang Zhang, a research scientist at the MIT-IBM Watson AI Lab. The paper will be presented at the International Conference on Learning Representations.

    Tackling visual tasks

    For the past few years, Fan and her colleagues have studied the use of generative AI models to perform complex reasoning and planning, often employing large language models (LLMs) to process text inputs.

    Many real-world planning problems, like robotic assembly and autonomous driving, have visual inputs that an LLM can’t handle well on its own. The researchers sought to expand into the visual domain by utilizing vision-language models (VLMs), powerful AI systems that can process images and text.

    But VLMs struggle to understand spatial relationships between objects in a scene and often fail to reason correctly over many steps. This makes it difficult to use VLMs for long-range planning.

    On the other hand, scientists have developed robust, formal planners that can generate effective long-horizon plans for complex situations. However, these software systems can’t process visual inputs and require expert knowledge to encode a problem into language the solver can understand.

    Fan and her team built an automatic planning system that takes the best of both methods. The system, called VLM-guided formal planning (VLMFP), utilizes two specialized VLMs that work together to turn visual planning problems into ready-to-use files for formal planning software.

    The researchers first carefully trained a small model they call SimVLM to specialize in describing the scenario in an image using natural language and simulating a sequence of actions in that scenario. Then a much larger model, which they call GenVLM, uses the description from SimVLM to generate a set of initial files in a formal planning language known as the Planning Domain Definition Language (PDDL).

    The files are ready to be fed into a classical PDDL solver, which computes a step-by-step plan to solve the task. GenVLM compares the results of the solver with those of the simulator and iteratively refines the PDDL files.

    “The generator and simulator work together to be able to reach the exact same result, which is an action simulation that achieves the goal,” Hao says.

    Because GenVLM is a large generative AI model, it has seen many examples of PDDL during training and learned how this formal language can solve a wide range of problems. This existing knowledge enables the model to generate accurate PDDL files.

    A flexible approach

    VLMFP generates two separate PDDL files. The first is a domain file that defines the environment, valid actions, and domain rules. It also produces a problem file that defines the initial states and the goal of a particular problem at hand.

    “One advantage of PDDL is the domain file is the same for all instances in that environment. This makes our framework good at generalizing to unseen instances under the same domain,” Hao explains.

    To enable the system to generalize effectively, the researchers needed to carefully design just enough training data for SimVLM so the model learned to understand the problem and goal without memorizing patterns in the scenario. When tested, SimVLM successfully described the scenario, simulated actions, and detected if the goal was reached in about 85 percent of experiments.

    Overall, the VLMFP framework achieved a success rate of about 60 percent on six 2D planning tasks and greater than 80 percent on two 3D tasks, including multirobot collaboration and robotic assembly. It also generated valid plans for more than 50 percent of scenarios it hadn’t seen before, far outpacing the baseline methods.

    “Our framework can generalize when the rules change in different situations. This gives our system the flexibility to solve many types of visual-based planning problems,” Fan adds.

    In the future, the researchers want to enable VLMFP to handle more complex scenarios and explore methods to identify and mitigate hallucinations by the VLMs.

    “In the long term, generative AI models could act as agents and make use of the right tools to solve much more complicated problems. But what does it mean to have the right tools, and how do we incorporate those tools? There is still a long way to go, but by bringing visual-based planning into the picture, this work is an important piece of the puzzle,” Fan says.

    This work was funded, in part, by the MIT-IBM Watson AI Lab.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    InfoForTech
    • Website

    Related Posts

    AI Is Helping Security Teams Move from Detection to Action

    May 8, 2026

    How AI Can Build a Future that Works for Everyone [MAICON 2026]

    May 7, 2026

    Most Companies Got Breached Through SaaS And AI Last Year

    May 7, 2026

    Study: Firms often use automation to control certain workers’ wages | MIT News

    May 7, 2026

    Web Application Firewalls Are Broken, and Everyone Knows It

    May 6, 2026

    U.S. Officials Want Early Access to Advanced AI, and the Big Companies Have Agreed

    May 6, 2026
    Leave A Reply Cancel Reply

    Advertisement
    Top Posts

    DoJ Disrupts 3 Million-Device IoT Botnets Behind Record 31.4 Tbps Global DDoS Attacks

    March 20, 202638 Views

    Microsoft is bringing an AI helper to Xbox consoles

    March 14, 202615 Views

    We’re Tracking Streaming Price Hikes in 2026: Spotify, Paramount Plus, Crunchyroll and Others

    February 15, 202615 Views

    This is the tech that makes Volvo’s latest EV a major step forward

    January 24, 202615 Views
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Advertisement
    About Us
    About Us

    Our mission is to deliver clear, reliable, and up-to-date information about the technologies shaping the modern world. We focus on breaking down complex topics into easy-to-understand insights for professionals, enthusiasts, and everyday readers alike.

    We're accepting new partnerships right now.

    Facebook X (Twitter) YouTube
    Most Popular

    DoJ Disrupts 3 Million-Device IoT Botnets Behind Record 31.4 Tbps Global DDoS Attacks

    March 20, 202638 Views

    Microsoft is bringing an AI helper to Xbox consoles

    March 14, 202615 Views

    We’re Tracking Streaming Price Hikes in 2026: Spotify, Paramount Plus, Crunchyroll and Others

    February 15, 202615 Views
    Categories
    • Artificial Intelligence
    • Cybersecurity
    • Innovation
    • Latest in Tech
    © 2026 All Rights Reserved InfoForTech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.