PickScore v1
yuvalkirstain/PickScore_v1
published Apr 2023 · updated May 2023
PickScore v1 is a zero-shot-image model that scores images generated from text based on human preference predictions.
specs
| Task | Zero-shot image scoring for human preference prediction |
| Architecture | CLIP-H (ViT-H-14) fine-tuned on Pick-a-Pic dataset |
| Training Data | Pick-a-Pic v1 dataset |
| Paper | Pick-a-Pic (arXiv:2305.01569) |
about this model
PickScore v1 is a zero-shot image scoring model that evaluates the alignment between a text prompt and a generated image, outputting a score that reflects how well the image matches the prompt. It was fine-tuned from CLIP-H (ViT-H-14) on the Pick-a-Pic dataset, a large open dataset of real user preferences for text-to-image generation. The model acts as a general scoring function for tasks such as human preference prediction, model evaluation, and image ranking.
Performance
PickScore exhibits superhuman performance on the task of predicting human preferences for generated images. According to the Pick-a-Pic paper, it correlates better with human rankings than other automatic evaluation metrics, making it a reliable tool for assessing text-to-image generation models without requiring human raters. The model is recommended for evaluating future text-to-image models and can be used to enhance existing models via ranking.
How It Works
The model takes a text prompt and one or more images as input. It computes embeddings for both the text and each image using a shared CLIP-H backbone, normalizes them, and calculates a score via the dot product scaled by the learned logit scale. When multiple images are supplied, softmax can be applied to obtain relative preference probabilities.
Training Data
PickScore was trained on the Pick-a-Pic dataset v1, which contains prompts and real user preferences over generated images collected through a dedicated web application. The dataset is publicly available.
Additional Resources
best for
- ·Predicting human preferences for generated images
- ·Ranking multiple generated images from a text prompt
- ·Evaluating text-to-image generation models
- ·Enhancing text-to-image models via reranking
FAQ
It is a scoring function for images generated from text, used for human preference prediction, image ranking, and model evaluation.
It correlates better with human rankings than other automatic metrics, as shown in the Pick-a-Pic paper.
Input: a text prompt and one or more images. Output: scores (logits) and probabilities for each image.
Use the gigarouter OpenAI-compatible endpoint with your API key. Refer to the gigarouter documentation for endpoint details.
The model card does not specify a license; check the repository or paper for details.
We're benchmarking and onboarding PickScore v1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.