AttentionViz: A Global View of Transformer Attention

Catherine Yeh1, Yida Chen1, Aoyu Wu1, Cynthia Chen1, Fernanda Viégas1,2, Martin Wattenberg1,2

1Harvard University, 2Google Research

Paper Demo GitHub Repo

What is transformer attention?

As the models behind popular systems such as ChatGPT and BingAI, transformers are taking the world by storm. The transformer neural network architecture has been used in NLP and computer vision settings, achieving significant performance improvements across various tasks.

Key to transformers' success is their characteristic self-attention mechanism, which allows these models to learn rich, contextual relationships between elements of a sequence. For example, in the sentence "the sky is blue," we might expect high attention between the words "sky" and "blue," and lower attention between "the" and "blue."

To compute attention, we first transform each input element (e.g., a word in a sentence or patch of an image) into a corresponding query and key vector. At a high level, the attention between two words or image patches can be viewed as a function of the dot product between the corresponding query and key. As such, attention is often visualized with a bipartite graph representation (in the language case), where the opacity of each edge indicates the attention strength between query-key pairs.

This approach works for visualizing attention in single input sequences, but what happens if we have hundreds or even thousands of inputs to examine?

Presenting a "global view" of attention...

To address this challenge of analyzing and synthesizing attention patterns at scale, we propose a global view of transformer attention. We create this global view by designing a new visualization technique and applying it to build an interactive tool for exploring attention in transformer models.

Technique

For each attention head in a transformer, we transform a set of input sequences into their corresponding query and key vectors, creating a joint embedding in a high-dimensional space. In this joint embedding space, distance turns out to be a reasonable proxy for attention weights: query-key pairs with higher attention weights will generally be closer together. Using methods such as t-SNE or UMAP, we visualize this embedding in two or three dimensions, providing a "global" view of attention patterns.

Please see our paper for more details about this technique, including input normalization.

Tool

Using this joint embedding technique, we created AttentionViz, an interactive tool for visualizing self-attention patterns at scale. AttentionViz allows attention exploration at multiple levels for both language and vision transformers. We currently support BERT (language), GPT-2 (language), and ViT (vision). Some example inputs to AttentionViz are shown below (in reality, we use many sentences and images to form each joint query-key embedding!).

The three main interactive views provided by AttentionViz include:

Matrix View

View all the attention heads (i.e., patterns) in a transformer at once Read More

Single View

Explore a single attention head in closer detail Read More

Image/Sentence View

Visualize attention patterns within a single sentence or image Read More

Example findings from AttentionViz

With AttentionViz, we uncovered several interesting insights about self-attention in language and vision transformers. A few examples are shared below; for more details, please see our paper.

Hue/brightness specializations

For ViT, we were curious if any visual attention heads specialize in color or brightness based patterns. To test this, we created a dataset with synthetic color/brightness gradient images and loaded the resultant query and key tokens into AttentionViz.

As a result, we discovered one head (layer 0 head 10) that aligns the black-and-white image tokens based on brightness, and another head (layer 1 head 11) that aligns colorful patches based on hue. Our dataset contains color and brightness gradient images in all orientations, and we see similar patches cluster together in the joint embedding space regardless of their position in the original images. The attention heatmap in Image View confirms these findings; tokens pay the most attention to other tokens with the same color or brightness.

Global traces of attention

While exploring BERT, we observed some attention heads with unique, identifiable shapes. For example, in early model layers, we noticed some spiral-shaped plots (e.g., layer 3 head 9). Coloring by token position reveals a positional trend, where token position increases as we move from the outside to inside of the spiral. Sentence View confirms that there is a "next-token" attention pattern.

Similarly, we noticed that plots with "small clumps" also encode positional patterns (e.g., layer 2 head 0). This can be verified by coloring each token by its position mod 5, which forms a more discrete positional color scheme and can be helpful in visualizing relationships between query-key pairs based on small offsets in sentence position. The main difference between "spirals" and "clumps" appears to be whether tokens attend selectively to others one position away vs. at several different possible positions.

Induction heads in BERT?

We were also curious whether AttentionViz could be used to explore induction heads. At a high level, induction heads perform prefix matching and copying on repeated sequences to help language transformers perform in-context learning. For a more comprehensive overview of induction heads and in-context learning, please check out this Anthropic article.

To our knowledge, induction heads have only been studied in unidirectional models like GPT-2, but with AttentionViz, we also discovered potential induction head behavior in BERT, which uses a bidirectional attention mechanism. One head, layer 8 head 2, appears to demonstrate standard copying behavior, where a token A (e.g., - ) pays attention to the token B (e.g., 8 or 10) that came before it in a previous occurence. The example below also shows how each A can attend to multiple Bs. Since BERT is bidirectional, it can perform copying in both directions as well.

Another head, layer 9 head 9, seems to be a potential "reverse" induction head. In this case, a token A pays attention to the token B that came after it in another occurence. More work is needed to validate these observations, but our findings support the possibility of induction heads and in-context learning in bidirectional transformers like BERT.

Acknowledgments: We would like to thank Naomi Saphra for suggesting the color by token frequency option for language transformers. We are also grateful for all the participants in our user interviews for their time, feedback, and invaluable insights.