[CVPR 2024]

Language Models as Black-Box Optimizers for Vision-Language Models

Carnegie Mellon University

*Indicates Equal Contribution


Vision-language models (VLMs) pre-trained on webscale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic “hill-climbing” procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit “gradient” direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we demonstrate our framework on a state-of-the-art black-box VLM (DALL-E 3) for text-to-image optimization.

Methods Overview

Image illustrating ChatGPT interaction with VLMs

Similar to how human prompt engineers iteratively test and refine prompts, we employ ChatGPT to continuously optimize prompts for vision-language models (VLMs). Our iterative approach assesses the performance of ChatGPT-generated prompts on a few-shot dataset (highlighted in blue) and provides feedback (marked in violet) to ChatGPT through simple conversations, as depicted in the illustrative figure. This straightforward method delivers state-of-the-art results for one-shot image classification across 11 datasets using CLIP, operated in a black-box manner without accessing model weights, feature embeddings, or output logits. Remarkably, our approach outperforms both white-box methods such as gradient-based continuous prompting (CoOp) and human-engineered prompts in this extremely low-shot scenario. This figure only shows a typical conversation using ChatGPT's web user interface. Our code implementation follows this pattern using the ChatGPT API.

Optimizing Text-to-Image (T2I) Generation

Image showing DALL-E 3 output

Using chat-based multimodal LLMs, we apply our framework to optimize prompts for the state-of-the-art black-box generative VLM, DALL-E 3, using the multimodal GPT4-V. For complicated user queries that DALL-E 3 may initially fail to generate, we send the generated image (highlighted in violet) along with the current prompt to GPT4-V to ask for feedback on improvements (highlighted in red) and then generate a new prompt (highlighted in blue). We show that such a simple framework is surprisingly effective at correcting DALL-E 3 mistakes on some challenging Winoground text queries that involve action, logical, and spatial reasoning. We open-source our code at this link to facilitate future research on AI-driven content generation.

Prompt Inversion using Chat-based Multimodal LLMs

Given a user-specified reference (query) image, our framework reverse-engineers the prompt to have DALL-E generate the same object or scene in the query image. This enables users to easily make customizations, such as having the character in a reference image perform various actions or change scenes.

Visualization of Prompt Inversion Process

We apply our framework to reverse engineer the text prompt to generate the same user-queried image. We send the generated image (highlighted in violet) along with the original image to GPT4-V to ask for feedback on improvements (highlighted in red) and then generate a new prompt (highlighted in blue).

Image customization based on inverted images

User Query Inverted Image Example 1 Example 2 Example 3 Example 4 Example 5
User Query Image Inverted Image Example 1
Give the dog a cat friend.
Example 2
Make the dog be in the middle of a jump.
Example 3
Make the dog do a handstand.
Example 4
Make the dog lie down on its side.
Example 5
Make the dog swim in water.
User Query Image Inverted Image Example 1
Make the owl fight a hawk.
Example 2
Make the owl flap its wings.
Example 3
Make the owl fully green.
Example 4
Make the owl stand in front of the moon.
Example 5
Make the owl walk in the city.

Comparison of our method with other baselines on one-shot classification tasks.

Image showing DALL-E 3 output

We report the average accuracy of each method across three folds, optimized using 1-shot training sets. We bold the best black-box result for each dataset, and underline the second best result. First, we note that our approach can effectively improve upon the initial prompts selected from LAIONCOCO-1M from 56% to 61%. Our approach is also competitive against the best Human-Engineered prompts released by OpenAI searched using test set performance. Additionally, we show that using both positive and negative prompts improves the overall accuracy by 1%. For reference, we report oracle white-box approaches in gray. Remarkably, we also surpass white-box solutions such as WiSE-FT and CoOp by 1.5%. These methods require either gradient-based fine-tuning (CoOp/WiSE-FT/Cross-Modal) or prompt ensembling using output logits (DCLIP). While our approach is less effective than the SOTA white-box method (Cross-Modal Adaptation), we stress that our black-box setup is significantly more challenging, because we restrict the optimization space to natural language and do not access the pre-trained weights, model architectures, feature embeddings, and output logits of VLMs.

Example resulting templates on each dataset.

Dataset Example of Top Templates
Caltech An image of a {} with a blurred background that emphasizes the subject
DTD The essential elements of {} are amplified with visual simplicity
EuroSAT A top-down view of {} arranged in a pattern {}
Aircraft A clear, high-quality image of a single {} with a white background
Food A {} featuring diverse cuisine and ingredients
ImageNet An image of a {} with bright and natural lighting
Flowers A clear and vivid photograph of the {} in its natural setting
Pets A {} with distinct and recognizable characteristics
Cars A {} featuring a wide range of color options for easy selection
SUN A high-resolution photo of a {} with clear background and natural lighting
UCF A black and white photo of a {} in motion

Although we do not provide ChatGPT with any information regarding the targeted dataset, we observe that the resulting templates are remarkably similar to human-engineered templates, with many domain-specific details such as “motion” and “cuisine”, and stylistic elements such as “bright and natural lighting”.


        title={Language Models as Black-Box Optimizers for Vision-Language Models}, 
        author={Shihong Liu and Zhiqiu Lin and Samuel Yu and Ryan Lee and Tiffany Ling and Deepak Pathak and Deva Ramanan},