[CVPR 2024] Language Models as Black-Box Optimizers for Vision-Language

Abstract

Vision-language models (VLMs) pre-trained on webscale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic “hill-climbing” procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit “gradient” direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we demonstrate our framework on a state-of-the-art black-box VLM (DALL-E 3) for text-to-image optimization.

Methods Overview

Image illustrating ChatGPT interaction with VLMs

Similar to how human prompt engineers iteratively test and refine prompts, we employ ChatGPT to continuously optimize prompts for vision-language models (VLMs). Our iterative approach assesses the performance of ChatGPT-generated prompts on a few-shot dataset (highlighted in blue) and provides feedback (marked in violet) to ChatGPT through simple conversations, as depicted in the illustrative figure. This straightforward method delivers state-of-the-art results for one-shot image classification across 11 datasets using CLIP, operated in a black-box manner without accessing model weights, feature embeddings, or output logits. Remarkably, our approach outperforms both white-box methods such as gradient-based continuous prompting (CoOp) and human-engineered prompts in this extremely low-shot scenario. This figure only shows a typical conversation using ChatGPT's web user interface. Our code implementation follows this pattern using the ChatGPT API.

Optimizing Text-to-Image (T2I) Generation

Using chat-based multimodal LLMs, we apply our framework to optimize prompts for the state-of-the-art black-box generative VLM, DALL-E 3, using the multimodal GPT4-V. For complicated user queries that DALL-E 3 may initially fail to generate, we send the generated image (highlighted in violet) along with the current prompt to GPT4-V to ask for feedback on improvements (highlighted in red) and then generate a new prompt (highlighted in blue). We show that such a simple framework is surprisingly effective at correcting DALL-E 3 mistakes on some challenging Winoground text queries that involve action, logical, and spatial reasoning. We open-source our code at this link to facilitate future research on AI-driven content generation.

The unmasked wrestler hits the masked wrestler.

Initial Image

Optimized Image

The person with earrings pays the person without earrings.

Initial Image

Optimized Image

A bird eats a snake.

Initial Image

Optimized Image

A shorter person is covering the eyes of a taller person.

Initial Image

Optimized Image

There is less milk than orange juice.

Initial Image

Optimized Image

Getting a horse wet.

Initial Image

Optimized Image

Some are parking in a train.

Initial Image

Optimized Image

The white wall will soon be painted blue.

Initial Image

Optimized Image

Prompt Inversion using Chat-based Multimodal LLMs

Given a user-specified reference (query) image, our framework reverse-engineers the prompt to have DALL-E generate the same object or scene in the query image. This enables users to easily make customizations, such as having the character in a reference image perform various actions or change scenes.

Visualization of Prompt Inversion Process

We apply our framework to reverse engineer the text prompt to generate the same user-queried image. We send the generated image (highlighted in violet) along with the original image to GPT4-V to ask for feedback on improvements (highlighted in red) and then generate a new prompt (highlighted in blue).

User Query

Initial Image

Final Image

Create a digital artwork of a stylized, geometric rhinoceros head with a dynamic array of sharp, crystalline facets in a monochromatic palette of black, white, and gray. The design should feature intricate shadows and highlights to produce a three-dimensional illusion, with a focus on accurately representing the creature's contours and muscle structure. Adjust the composition to show the rhinoceros head from a frontal perspective, ensuring that both the horn and the ears are symmetrically aligned in the center. Emphasize the geometric nature of the facets by making them more pronounced and varied in shape, creating a complex mosaic that captures the interplay of light and shadow. Add a slight glow to the edges of the facets to enhance the three-dimensional effect and the metallic quality of the artwork. Display the rhinoceros head against a pitch-black background, with a light source positioned to cast dramatic, high-contrast illumination that emphasizes its multifaceted texture. Incorporate a subtle reflective sheen on the surface to suggest a sleek, metallic finish, and ensure the rhinoceros's eye is detailed and expressive, contributing to the overall lifelike appearance of the artwork.

User Query

Initial Image

Final Image

A hyper-realistic full slice of an orange with intricate details, including the textured pulp and clearly defined rind, positioned off-center on a reflective gradient surface transitioning from white to dark. The orange's juicy texture is accentuated by a dynamic splash of juice, with droplets captured mid-air, creating an energetic and lively scene. The lighting is dramatic and contrasting, with a spotlight effect casting a pronounced shadow to one side to enhance the three-dimensional effect and emphasize the vibrant orange color. Include a clear reflection on the surface and a small stem attached to the orange slice to underscore the realism and freshness. Enhance the composition by ensuring the orange slice is angled slightly, with the splash of juice originating from the lower right side, to add a sense of motion and vitality.

User Query

Initial Image

Final Image

A medieval knight in full armor stands with a shield, the dark background highlighting his silhouette against a subtle warm glow. His helmet features a visor with a single vertical slit, and his armor includes a chainmail coif beneath a segmented plate gorget and articulated plate gauntlets, with layered plate armor and flared ridged pauldrons. The knight's shield is centered and bears a detailed, embossed golden fleur-de-lis on a field of weathered steel, surrounded by rivets. The vibrant orange cloak drapes over both shoulders and behind his back, adding a touch of regal color to the composition. His stance is grounded and balanced, with his left arm extended, presenting the shield, and his right hand resting on the pommel of his sword, exuding a calm and noble demeanor.

User Query

Initial Image

Final Image

Create a stylized illustration of a dove in flight, with feathers that transition smoothly through a spectrum of colors including red, orange, yellow, green, blue, indigo, and violet. The dove's plumage should resemble a dynamic, three-dimensional arrangement of vibrant, overlapping feathers, giving a sense of movement and freedom. The style should be a fusion of semi-realistic and digital art, with a focus on vivid colors and a clean, light background that emphasizes the artwork's lively and spirited nature. Adjust the feather arrangement to be more structured and flame-like, with the feathers at the tips being more elongated and pointed to enhance the sense of elegance and flow.

User Query

Initial Image

Final Image

Create an illustration of a stylized, geometric dinosaur with a textured body in two shades of green: a lighter green for the main body and a darker green for the spiky plates along its back. The dinosaur should have a friendly demeanor, with a long, curved tail and a smooth, rounded head featuring two small, circular white eyes with black pupils. It should stand on two legs with small, rounded feet, each with three visible toes. The background should be a flat, light beige color, with a simple, elongated shadow extending to the right of the dinosaur, indicating a soft light source to the left.

User Query

Initial Image

Final Image

Generate an image of a cartoon-style polar bear with gleefully closed eyes and a wide, toothy grin, revealing just a hint of its tongue. The bear should look exuberant, standing on its hind legs with arms open wide as if ready for a hug. The bear's fur should appear extremely soft and fluffy, with a pronounced blush of rosy pink on both cheeks and belly, enhancing its charm. Adorn the bear with a cozy, chunky-knit scarf, vibrant red with prominent, horizontal white stripes, stylishly wrapped around its neck and draping with a dense tassel fringe at the ends. Situate the bear against a gentle pastel pink backdrop, scattered with delicate, small snowflakes, conveying the splendor and coziness of festive winter cheer.

User Query

Initial Image

Final Image

An anthropomorphic duck standing confidently with hands on hips, styled as a classic film noir detective. The duck has a calm and cool expression, wearing a tan detective's fedora hat and a matching double-breasted trench coat, buttoned up, with a broad collar, epaulets, and a belted waist. The character has a white shirt and a patterned tie with a diagonal stripe design underneath. The character has orange webbed feet and a large, prominent beak. The lighting is dramatic, with a strong contrast between light and shadow, creating a focused shadow on the background that mimics the character's silhouette. The overall color palette is warm with a gentle light source coming from the side, casting the background in a gradient from warm beige to shadows, giving the image a mysterious and dramatic appearance.

Image customization based on inverted images

User Query	Inverted Image	Example 1	Example 2	Example 3	Example 4	Example 5
		Give the dog a cat friend.	Make the dog be in the middle of a jump.	Make the dog do a handstand.	Make the dog lie down on its side.	Make the dog swim in water.
		Make the owl fight a hawk.	Make the owl flap its wings.	Make the owl fully green.	Make the owl stand in front of the moon.	Make the owl walk in the city.

Comparison of our method with other baselines on one-shot classification tasks.

We report the average accuracy of each method across three folds, optimized using 1-shot training sets. We bold the best black-box result for each dataset, and underline the second best result. First, we note that our approach can effectively improve upon the initial prompts selected from LAIONCOCO-1M from 56% to 61%. Our approach is also competitive against the best Human-Engineered prompts released by OpenAI searched using test set performance. Additionally, we show that using both positive and negative prompts improves the overall accuracy by 1%. For reference, we report oracle white-box approaches in gray. Remarkably, we also surpass white-box solutions such as WiSE-FT and CoOp by 1.5%. These methods require either gradient-based fine-tuning (CoOp/WiSE-FT/Cross-Modal) or prompt ensembling using output logits (DCLIP). While our approach is less effective than the SOTA white-box method (Cross-Modal Adaptation), we stress that our black-box setup is significantly more challenging, because we restrict the optimization space to natural language and do not access the pre-trained weights, model architectures, feature embeddings, and output logits of VLMs.

Example resulting templates on each dataset.

Dataset	Example of Top Templates
Caltech	An image of a {} with a blurred background that emphasizes the subject
DTD	The essential elements of {} are amplified with visual simplicity
EuroSAT	A top-down view of {} arranged in a pattern {}
Aircraft	A clear, high-quality image of a single {} with a white background
Food	A {} featuring diverse cuisine and ingredients
ImageNet	An image of a {} with bright and natural lighting
Flowers	A clear and vivid photograph of the {} in its natural setting
Pets	A {} with distinct and recognizable characteristics
Cars	A {} featuring a wide range of color options for easy selection
SUN	A high-resolution photo of a {} with clear background and natural lighting
UCF	A black and white photo of a {} in motion

Although we do not provide ChatGPT with any information regarding the targeted dataset, we observe that the resulting templates are remarkably similar to human-engineered templates, with many domain-specific details such as “motion” and “cuisine”, and stylistic elements such as “bright and natural lighting”.

BibTeX

@article{liu2023language,
  title={Language models as black-box optimizers for vision-language models},
  author={Liu, Shihong and Lin, Zhiqiu and Yu, Samuel and Lee, Ryan and Ling, Tiffany and Pathak, Deepak and Ramanan, Deva},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

[CVPR 2024]

Language Models as Black-Box Optimizers for Vision-Language Models

Abstract

Methods Overview

Optimizing Text-to-Image (T2I) Generation

Prompt Inversion using Chat-based Multimodal LLMs

Image customization based on inverted images

Comparison of our method with other baselines on one-shot classification tasks.

Example resulting templates on each dataset.

BibTeX