Affiliation:Okayama University of Science
Abstract:As stable diffusion models that perform high-quality image generation have attracted more attention, it has become important to evaluate their quality. The synthetic image metrics are categorized into two types of measures: the first type focuses on image quality, while the second type concentrates on text-image alignment. To evaluate text-image alignment, a new metric called the Text-Image Alignment Metric (TIAM) is proposed. This metric checks the alignment between the content specified in a prompt and the corresponding generated image, based on a prompt template. TIAM enables a more comprehensive evaluation of the alignment between text prompts and images in terms of the type, number, and color of the specified objects. To identify the specified objects, TIAM uses a pre-trained object detection and segmentation model YOLOv8. However, the pre-trained models cannot be used for classes or image styles that have not been pre-trained except for fine-tuning. For non-professional users, fine-tuning a model or preparing a training dataset is difficult. In this paper, we extend TIAM to support various classes and styles by utilizing the attention maps acquired during the image generation process and the language-vision model (e.g., BLIP2). The experimental results indicate that the proposed method allows us to evaluate diverse images without requiring additional steps, such as fine-tuning.
Publication related to your research:
(International conference paper)
- Haruno Fusa, Chonho Lee, Sakuei Onishi, Hiromitsu Shiina, "Metric for Evaluating Stable Diffusion Models Using Attention Maps", Proc. of the International Conference on Foundation and Large Language Models (FLLM2024), pp. 535-541, 2024, November
(Domestic conference/wokrshop)
- Haruno Fusa, Chonho Lee, Sakuei Onishi, Hiromitsu Shiina, "Text-to-Image モデルにおける多属性に対応したテンプレートベース評価手法", 人工知能学会JSAI2025大会
Posted : March 31,2025