Microsoft Researchers Present InstructDiffusion: A Unifying and Generic AI Framework for Aligning Computer Vision Tasks with Human Instructions

    2023-09-12

    In a groundbreaking stride towards adaptable, generalist vision models, researchers from Microsoft Research Asia have unveiled InstructDiffusion.

    In a groundbreaking stride towards adaptable, generalist vision models, researchers from Microsoft Research Asia have unveiled InstructDiffusion. This innovative framework revolutionizes the landscape of computer vision by providing a unified interface for a multitude of vision tasks. The paper “InstructDiffusion: A Generalist Modeling Interface for Vision Tasks” introduces a model capable of seamlessly handling various vision applications simultaneously.

    At the heart of InstructDiffusion lies a novel approach: formulating vision tasks as human-intuitive image manipulation processes. Unlike conventional methods that rely on predefined output spaces, such as categories or coordinates, InstructDiffusion operates in a flexible pixel space, aligning more closely with human perception.

    The model is designed to alter input images based on textual instructions provided by the user. For instance, a directive like “encircle the man’s right eye in red” empowers the model for tasks like keypoint detection. At the same time, instructions like “apply a blue mask to the rightmost dog” serve segmentation purposes.

    The first and only AI project manager (Sponsored)

    Underpinning this framework are denoising diffusion probabilistic models (DDPM), which generate pixel outputs. Training data comprises triplets, each consisting of an instruction, source image, and target output image. The model is primed to tackle three main output types: RGB images, binary masks, and keypoints. This covers a wide array of vision tasks, including segmentation, keypoint detection, image editing, and enhancement.

    Keypoint Detection

    Keypoint Detection
    Keypoint Detection

    (a) Create a yellow circle around the right eye of the whale. (b) Mark the car logo with a blue circle.

    Segmentation

    Segmentation
    Segmentation

    (a) Mark the pixels of the cat in the mirror to blue and leave the rest unchanged. (b) Paint the pixels of shadow in blue and maintain the current appearance of the other pixels.

    Image Editing

    Image Editing
    Image Editing

    Image results generated by the model

    Low level tasks

    Low level tasks
    Low level tasks

    InstructDiffusion is also applicable to low-level vision tasks, including image deblurring, denoising, and watermark removal.

    Experiments demonstrate InstructDiffusion’s prowess, outperforming specialized models in individual tasks. However, the true marvel lies in its capacity for generalization. It exhibits the hallmark trait often associated with Artificial General Intelligence (AGI), adeptly adapting to tasks not encountered during training. This marks a significant stride towards a unified, flexible framework for computer vision, poised to advance the entire field.

    A key revelation was that concurrently training the model on diverse tasks notably amplified its ability to generalize to novel scenarios. InstructDiffusion exhibited remarkable proficiency on the HumanArt and AP-10K animal datasets for keypoint detection despite distinct data distributions compared to the training data.

    The research team underscored the critical importance of highly detailed instructions in enhancing the model’s generalization capabilities. Mere task names like “semantic segmentation” proved insufficient, yielding subpar performance, particularly on novel data types. This underscores InstructDiffusion’s ability to grasp specific meanings and intentions behind detailed instructions rather than relying on memorization.

    By emphasizing comprehension over memorization, InstructDiffusion learns robust visual concepts and semantic meanings. This distinction is pivotal in understanding its remarkable generalization capabilities. For example, an instruction like “encircle the cat’s left ear in red” enables the model to discern specific elements, such as “cat,” “left ear,” and “red circle,” showcasing its granular comprehension.

    This groundbreaking development catapults computer vision models towards becoming versatile generalists, mirroring human perception. InstructDiffusion’s interface introduces flexibility and interactivity absent in most current vision systems, bridging the gap between human and machine understanding in computer vision. The implications of this research are profound, as it paves the way for the development of capable multi-purpose vision agents, demonstrating its potential to propel general visual intelligence to new heights.