Vision-language models (VLMs) pre-trained on webscale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic “hill-climbing” procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit “gradient” direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we demonstrate our framework on a state-of-the-art black-box VLM (DALL-E 3) for text-to-image optimization.
Similar to how human prompt engineers iteratively test and refine prompts, we employ ChatGPT to continuously optimize prompts for vision-language models (VLMs). Our iterative approach assesses the performance of ChatGPT-generated prompts on a few-shot dataset (highlighted in blue) and provides feedback (marked in violet) to ChatGPT through simple conversations, as depicted in the illustrative figure. This straightforward method delivers state-of-the-art results for one-shot image classification across 11 datasets using CLIP, operated in a black-box manner without accessing model weights, feature embeddings, or output logits. Remarkably, our approach outperforms both white-box methods such as gradient-based continuous prompting (CoOp) and human-engineered prompts in this extremely low-shot scenario. This figure only shows a typical conversation using ChatGPT's web user interface. Our code implementation follows this pattern using the ChatGPT API.
Using chat-based multimodal LLMs, we apply our framework to optimize prompts for the state-of-the-art black-box generative VLM, DALL-E 3, using the multimodal GPT4-V. For complicated user queries that DALL-E 3 may initially fail to generate, we send the generated image (highlighted in violet) along with the current prompt to GPT4-V to ask for feedback on improvements (highlighted in red) and then generate a new prompt (highlighted in blue). We show that such a simple framework is surprisingly effective at correcting DALL-E 3 mistakes on some challenging Winoground text queries that involve action, logical, and spatial reasoning. We open-source our code at this link to facilitate future research on AI-driven content generation.
The unmasked wrestler hits the masked wrestler.
Initial Image
Optimized Image
The person with earrings pays the person without earrings.
Initial Image
Optimized Image
A bird eats a snake.
Initial Image
Optimized Image
A shorter person is covering the eyes of a taller person.
Initial Image
Optimized Image
There is less milk than orange juice.
Initial Image
Optimized Image
Getting a horse wet.
Initial Image
Optimized Image
Some are parking in a train.
Initial Image
Optimized Image
The white wall will soon be painted blue.
Initial Image
Optimized Image
Given a user-specified reference (query) image, our framework reverse-engineers the prompt to have DALL-E generate the same object or scene in the query image. This enables users to easily make customizations, such as having the character in a reference image perform various actions or change scenes.
We apply our framework to reverse engineer the text prompt to generate the same user-queried image. We send the generated image (highlighted in violet) along with the original image to GPT4-V to ask for feedback on improvements (highlighted in red) and then generate a new prompt (highlighted in blue).
User Query
Initial Image
Final Image
User Query
Initial Image
Final Image
User Query
Initial Image
Final Image
User Query
Initial Image
Final Image
User Query
Initial Image
Final Image
User Query
Initial Image
Final Image
User Query
Initial Image
Final Image
User Query | Inverted Image | Example 1 | Example 2 | Example 3 | Example 4 | Example 5 |
---|---|---|---|---|---|---|
We report the average accuracy of each method across three folds, optimized using 1-shot training sets. We bold the best black-box result for each dataset, and underline the second best result. First, we note that our approach can effectively improve upon the initial prompts selected from LAIONCOCO-1M from 56% to 61%. Our approach is also competitive against the best Human-Engineered prompts released by OpenAI searched using test set performance. Additionally, we show that using both positive and negative prompts improves the overall accuracy by 1%. For reference, we report oracle white-box approaches in gray. Remarkably, we also surpass white-box solutions such as WiSE-FT and CoOp by 1.5%. These methods require either gradient-based fine-tuning (CoOp/WiSE-FT/Cross-Modal) or prompt ensembling using output logits (DCLIP). While our approach is less effective than the SOTA white-box method (Cross-Modal Adaptation), we stress that our black-box setup is significantly more challenging, because we restrict the optimization space to natural language and do not access the pre-trained weights, model architectures, feature embeddings, and output logits of VLMs.
Dataset | Example of Top Templates |
---|---|
Caltech | An image of a {} with a blurred background that emphasizes the subject |
DTD | The essential elements of {} are amplified with visual simplicity |
EuroSAT | A top-down view of {} arranged in a pattern {} |
Aircraft | A clear, high-quality image of a single {} with a white background |
Food | A {} featuring diverse cuisine and ingredients |
ImageNet | An image of a {} with bright and natural lighting |
Flowers | A clear and vivid photograph of the {} in its natural setting |
Pets | A {} with distinct and recognizable characteristics |
Cars | A {} featuring a wide range of color options for easy selection |
SUN | A high-resolution photo of a {} with clear background and natural lighting |
UCF | A black and white photo of a {} in motion |
Although we do not provide ChatGPT with any information regarding the targeted dataset, we observe that the resulting templates are remarkably similar to human-engineered templates, with many domain-specific details such as “motion” and “cuisine”, and stylistic elements such as “bright and natural lighting”.
@article{liu2023language,
title={Language models as black-box optimizers for vision-language models},
author={Liu, Shihong and Lin, Zhiqiu and Yu, Samuel and Lee, Ryan and Ling, Tiffany and Pathak, Deepak and Ramanan, Deva},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}