• By Asher Hancock

    The promise of vision-language-action models is straightforward: Take a foundation model that understands the world and teach it to act in it. The challenge is that learning to act often erases that understanding.

    Vision-language models (VLMs) are trained on massive image–text datasets and develop broad multimodal knowledge. They can recognize objects, interpret scenes, and follow instructions in multiple languages. Naturally, they’ve become a promising foundation for robot learning.

    But when researchers fine-tune a VLM to control a robot, turning it into a vision-language-action model (VLA), performance on visual reasoning, multilingual tasks, and open-world queries often degrades. This phenomenon, known as catastrophic forgetting, has become a central obstacle to adapting foundation models for embodied use.

    In Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting (Hancock, Wu, Zha, Russakovsky, Majumdar; 2025), which is being presented at ICLR this week, researchers at Princeton University propose a simple shift in perspective: Instead of changing the model to accommodate robot actions, change how actions are represented.


    The Action Representation Mismatch

    Distribution of action probabilities under Gemma-3-12B-IT before fine-tuning on robot teleoperation data. The model assigns significantly higher log-probabilities to actions represented as language compared to those defined by explicit tokenization modifications, e.g., least likely token assignment.

    VLMs are pretrained to reason and answer in natural language. Robot policies, however, must output continuous motor commands — numerical vectors describing movement and gripper state. Most existing VLAs bridge this gap by assigning special tokens to represent actions or by adding separate action-generation modules. Both approaches introduce a distribution shift between what the original VLM learned during internet-scale pre-training and what it sees during robotics fine-tuning.


    Actions as Language

    We present VLM2VLA, a data pipeline and training methodology for fine-tuning VLMs into VLAs while
    preserving their foundational perceptual and reasoning capabilities. Our policy retains its pretraining knowledge,
    enabling strong VQA performance and superior generalization in real robotic manipulation tasks.

    Our work, VLM2VLA, takes a different route: Express robot actions directly in natural language. Instead of predicting a high-dimensional motor vector, the model generates text such as:

    “To complete the task, the robot must move forward and slightly left, then move significantly downward before closing the gripper to grasp the object,”

    before producing the corresponding low-level commands to physically control the robot.

    Because this representation lives in language-space – what the model is already familiar with – fine-tuning can be done using LoRA, a parameter-efficient method that updates small low-rank weight matrices instead of modifying the full network. The model learns to act without overwriting what it already knows.

    Control is structured hierarchically: The model predicts a subtask, describes a spatial plan in language, and then generates the low-level action — all as text. The team automatically relabels existing robot trajectories into this format, converting demonstrations into language-aligned training data.


    What Happens in Practice?

    Across twelve multimodal understanding benchmarks, VLM2VLA retains over 85% of the base model’s performance. In contrast, conventional VLAs show substantial drops after fine-tuning.

    On more than 800 real-world trials with a 6-DoF robotic arm, VLM2VLA matches baseline performance on standard manipulation tasks like picking up and placing objects.

    The real payoff appears in out-of-distribution settings — tasks the robot was never trained on.

    Multilingual instruction following. The robot was asked to pick up a carrot using commands in Spanish (recoger la zanahoria), Mandarin (拿起胡萝卜), and Hindi (गाजर उठाओ). VLM2VLA significantly outperformed all baselines, correctly translating the instruction and identifying the target object among distractors.

    A qualitative demonstration of
    VLM2VLA’s zero-shot multilingual capabilities.
    Given the language instruction in Hindi (‘pick
    up the carrot’), our model identifies the correct object amidst distractors (eggplant and banana), demonstrating a genuine understanding
    of the task.

    Open-world semantic reasoning. When instructed to “pick up the item above Ash Ketchum,” the system had to recognize the famous pop-culture figure from Pokemon, Ash Ketchum, reason about spatial relationships, and manipulate the correct object. VLM2VLA achieved a 60% success rate; baselines performed near zero.

    Preserving pretrained knowledge directly translates to stronger embodied generalization.

    Comparative evaluation of VLA performance on in-distribution (ID) and out-of-distribution (OOD) robotic manipulation tasks. VLM2VLA maintains high success rates on OOD tasks, highlighting its superior generalization capabilities. Each bars corresponds to an average over thirty trials, except for the ‘Pick Up -T’ task, where each bar corresponds to an average over ninety trials.

    Does the Representation Itself Matter?

    To isolate the role of language-based action representation, the team trained an otherwise identical model that encodes actions using low-likelihood reserved tokens instead of natural language. Both models use LoRA and train on the same data.

    On simple tasks, performance is similar. But as reasoning demands increase, the language-based model pulls ahead. On the Ash Ketchum task, it achieves roughly twice the success rate of its token-based counterpart — suggesting that representation choice itself plays a key role in connecting world knowledge to physical action.


    Language as a Unified Representation

    Rather than introducing new architectures or sophisticated training pipelines, VLM2VLA shows that a representational shift in the data can be enough. By describing robot data in natural language, it becomes possible to add control capability without sacrificing multimodal understanding. More broadly, representing actions as language opens the possibility of seamlessly mixing robot interaction data with standard VLM corpora — enabling models that reason, communicate, and act within a unified representation space.


    Curious to learn more? You can read the full paper on arXiv and visit the project website for videos.

Latest Posts