On Augmenting Existing LLMs with New Modalities

XFusion If you, like me, frequently use various AI models for different tasks, you are familiar with the problem of a model lacking the required modality.
For example, Claude is excellent at coding, but cannot process audio directly.

Or, the desire to load a video in “thinking mode” might be hindered as it does not support this (only text), which can be extremely frustrating and disrupt the workflow, necessitating to search for alternatives or even perform data conversion manually.

This exact issue is addressed by researchers from the University of California and University of Wisconsin-Madison with Adobe Research, presenting the XFusion framework in their paper published on April 29.

Taking Meta’s LLama models (the researchers used LLama-3.1-8B) as an example, a model that is not vision-capable out of the box (i.e., cannot perform image-to-text or text-to-image tasks) gains entirely new functionalitywithout the complex process of sourcing new data, segmenting and preparing datasets, or monitoring training.

The X-Fusion architecture allows adding multiple modalities at once by freezing language layers and attaching vision modules as a dual-tower structure, applicable to any open-weights model (e.g., from Hugging Face). Training over 20,000 steps, the researchers added capabilities for image editing, object segmentation, and object replacementall within a single fine-tuning cycle.

This approach aims to streamline and reduce the cost of developing new multimodal models while enabling the retroactive integration of modalities (vision, audio processing, video handling, or other functionalities) into pre-existing models, a feat that was previously impossible—models are either multimodal or not, with no flexibility. Now, you could use a powerful text-only “base” model and, once a desired output is achieved, add new modalities later without investing resources in dedicated training for images, videos, or audio.

Image - visual demonstration of vision capabilities