AttriStory

Abstract

Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using a Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through the AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation.

The AttriStory Benchmark

Existing visual storytelling benchmarks (e.g., ConsiStory) capture narratives through minimal descriptions such as "A photo of a young boy riding a bike through golden fields." This is insufficient for professional creative workflows where artists specify rich details: "A watercolor illustration of Oliver, a lively 8-year-old boy with sandy blond hair and brown eyes, riding a green bike wearing a red hoodie past golden fields."

We define this as the attribute realization problem: ensuring that explicitly specified, fine-grained visual attributes (e.g., "red shirt", "blue sneakers") are actually realized in generated images, orthogonal to cross-scene character consistency.

ConsiStory vs AttriStory benchmark comparison — **Figure 2.** Comparison of story narratives in prior benchmarks vs. ours. ConsiStory (top) provides minimal visual specifications capturing only basic character identity and actions. AttriStory (bottom) enriches narratives with explicit positive (\(P^+\)) and negative (\(P^-\)) attribute-object pairs for each scene, enabling systematic evaluation of fine-grained attribute realization.

We use ChatGPT to generate stories with the following structured components per scene:

Character Description: A detailed multi-attribute specification (e.g., "Maya, a 25-year-old woman with curly brown hair")
Scene Narratives: Scene-specific text descriptions enriched with fine-grained visual attributes
Positive Pairs \(P^+\): Attribute-object pairs that should co-occur (e.g., ["pink", "dress"], ["white", "lilies"])
Negative Pairs \(P^-\): Pairs that should not co-occur (e.g., ["pink", "lilies"], ["white", "dress"]), preventing spurious associations

The benchmark comprises 200 multi-scene stories (5 scenes each) across 10 distinct artistic styles: photo, cartoon style, 3D animation, watercolor illustration, oil painting, crayon drawing, neon punk style, Pixar-style, hyperrealistic digital painting, and pastel color painting. Each scene is annotated with 2 to 5 attribute-object pairs.

LLM-driven benchmark generation pipeline — **Figure 3.** LLM-driven benchmark generation. The pipeline inputs artistic styles and structured instructions emphasizing explicit, fine-grained attribute specifications. For each story, the LLM chooses an artistic style and generates character descriptions, scene narratives, and positive (\(P^+\)) and negative (\(P^-\)) attribute-object pairs.

AttriLoss: Targeted IoU Loss for Attribute-Object Grounding

AttriLoss method overview — **Figure 4.** AttriLoss: Targeted IoU loss on cross-attention maps. Our method optimizes spatial overlap between attention maps of attribute-object token pairs during early denoising steps. By maximizing IoU for positive pairs (e.g., *pink* and *dress* should co-occur) and minimizing IoU for negative pairs (e.g., *pink* and *lilies* should not overlap), we guide the model to correctly localize fine-grained attributes.

A key challenge in visual storytelling is imperfect alignment between prompt-specified attributes and generated visual objects. For a prompt like "wearing a pink dress holding a bouquet of white lilies," attention maps for tokens "pink" and "lilies" may overlap spatially even though "pink" should only attend to "dress". This ambiguous overlap causes the model to generate, e.g., pink roses instead of white lilies.

Our key observation is that cross-attention maps provide a direct visualization of attribute-object associations. We explicitly manipulate the spatial overlap between maps to encourage correct and discourage spurious associations.

IoU between attention maps

For each text token \(k\), let \(\mathbf{A}_k \in \mathbb{R}^{H \times W}\) be its spatial cross-attention map aggregated across U-Net layers. The Intersection-over-Union between two maps is:

\begin{align} \text{IoU}(\mathbf{A}_i, \mathbf{A}_j) = \frac{\mathbf{A}_i \odot \mathbf{A}_j}{\max(\mathbf{A}_i + \mathbf{A}_j - \mathbf{A}_i \odot \mathbf{A}_j,\; \epsilon)} \end{align}

where \(\odot\) denotes element-wise multiplication and \(\epsilon\) is a small constant for numerical stability.

AttriLoss objective

Let \(P^+\) be positive attribute-object pairs that should co-occur spatially, and \(P^-\) be negative pairs that should not. The loss is:

\begin{align} \mathcal{L}_{\text{Attri}} = -\sum_{(i,j) \in P^+} \text{IoU}(\mathbf{A}_i, \mathbf{A}_j) + \sum_{(i,j) \in P^-} \text{IoU}(\mathbf{A}_i, \mathbf{A}_j) \end{align}

The latent code \(\mathbf{z}_t\) is updated via gradient descent:

\begin{align} \mathbf{z}'_t \leftarrow \mathbf{z}_t - \nabla_{\mathbf{z}_t} \mathcal{L}_{\text{Attri}} \end{align}

This update is applied during the early denoising timesteps (steps 1–25 out of 50), which are critical for establishing structure and semantic content including colors and textures. AttriLoss integrates as a plug-and-play module with Vanilla SDXL, ConsiStory, and StoryDiffusion: no architectural modifications or retraining required.

Attention maps comparison — **Figure 5.** Attention maps of ConsiStory and with AttriLoss. The baseline shows ambiguous spatial overlaps where attribute tokens *pink* and *lilies* attend to the same regions, resulting in pink roses. With AttriLoss, the attention maps for attribute-object pairs sharpen into distinct regions (*pink* and *lilies* don't overlap), achieving correct spatial localization.

Qualitative Results

ConsiStory baseline with and without AttriLoss — **Figure 6.** Qualitative results of ConsiStory with and without AttriLoss. Using ConsiStory, character consistency is maintained but fine-grained attributes are incorrectly bound: *white lilies* are rendered as *pink roses and white lilies* (scene 1), and the *red umbrella* appears partially blue (scene 2). With AttriLoss, attribute-object associations are properly grounded while character consistency is preserved.

StoryDiffusion baseline with and without AttriLoss — **Figure 7.** Qualitative results of StoryDiffusion with and without AttriLoss. Without AttriLoss, attributes such as *grey coat* (scene 2), *yellow coat* (scene 3), and *beige jacket* (scene 4) are not correctly rendered. With AttriLoss, compositional attributes are faithfully realized while character consistency is preserved across scenes.

Attribute realization across diverse stories and artistic styles — **Figure 8.** Attribute realization across diverse stories using ConsiStory (top) and with AttriLoss (bottom), across varied artistic styles: Pixar, cartoon, oil painting, photo, watercolor. AttriLoss corrects attribute-object binding failures: peacock's *red velvet capelet* (1), Dr. Barkley's *glasses* (2), *yellow flag* on the raft (3), Luke's *green hoodie* (4), Oliver's *green bike* (5).

Quantitative Results

We evaluate across four metrics: VQA-Score (attribute realization via VQA), CLIP-T (image-text alignment), CLIP-I (cross-scene character consistency), and DreamSim (perceptual quality). AttriLoss consistently improves all baselines. Notably, CLIP-I scores remain strong after adding AttriLoss, confirming that attribute grounding does not compromise cross-scene consistency.

Method	VQA-Score ↑	CLIP-T ↑	CLIP-I ↑	DreamSim ↑
1Prompt1Story	0.8117	0.3816	0.8410	0.6929
Vanilla SDXL	0.7957	0.3696	0.8188	0.6760
+ AttriLoss	0.8225	0.3775	0.8517	0.7170
StoryDiffusion	0.8363	0.3912	0.8301	0.6925
+ AttriLoss	0.8636	0.3874	0.8553	0.7215
ConsiStory	0.8136	0.3871	0.8494	0.7326
+ AttriLoss	0.8490	0.3909	0.8667	0.7555

Table 1. Quantitative comparison on integrating AttriLoss with prior visual storytelling methods. Bold green rows indicate + AttriLoss variants. ConsiStory + AttriLoss achieves the best overall performance.

BibTeX

@InProceedings{sreenivas2026attristory,
  author    = {Sreenivas, Manogna and Kumar, Rohit and Biswas, Soma},
  title     = {AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models},
  booktitle = {CVPR Workshops},
  year      = {2026}
}