AttentionViz Docs

What is transformer attention?

As the models behind popular systems such as ChatGPT and BingAI, transformers are taking the world by storm. The transformer neural network architecture has been used in NLP and computer vision settings, achieving significant performance improvements across various tasks.

Key to transformers' success is their characteristic self-attention mechanism, which allows these models to learn rich, contextual relationships between elements of a sequence. For example, in the sentence "the sky is blue," we might expect high attention between the words "sky" and "blue," and lower attention between "the" and "blue."

To compute attention, we first transform each input element (e.g., a word in a sentence or patch of an image) into a corresponding query and key vector. At a high level, the attention between two words or image patches can be viewed as a function of the dot product between the corresponding query and key. As such, attention is often visualized with a bipartite graph representation (in the language case), where the opacity of each edge indicates the attention strength between query-key pairs.

This approach works for visualizing attention in single input sequences, but what happens if we have hundreds or even thousands of inputs to examine?

Presenting a "global view" of attention...

To address this challenge of analyzing and synthesizing attention patterns at scale, we propose a global view of transformer attention. We create this global view by designing a new visualization technique and applying it to build an interactive tool for exploring attention in transformer models.

Technique

For each attention head in a transformer, we transform a set of input sequences into their corresponding query and key vectors, creating a joint embedding in a high-dimensional space. In this joint embedding space, distance turns out to be a reasonable proxy for attention weights: query-key pairs with higher attention weights will generally be closer together. Using methods such as t-SNE or UMAP, we visualize this embedding in two or three dimensions, providing a "global" view of attention patterns.

Please see our paper for more details about this technique, including input normalization.

Tool

Using this joint embedding technique, we created AttentionViz, an interactive tool for visualizing self-attention patterns at scale. AttentionViz allows attention exploration at multiple levels for both language and vision transformers. We currently support BERT (language), GPT-2 (language), and ViT (vision). Some example inputs to AttentionViz are shown below (in reality, we use many sentences and images to form each joint query-key embedding!).

Example input sentences and images to AttentionViz

The three main interactive views provided by AttentionViz include:

Matrix View

View all the attention heads (i.e., patterns) in a transformer at once Read More

Single View

Explore a single attention head in closer detail Read More

Image/Sentence View

Visualize attention patterns within a single sentence or image Read More

The initial view in AttentionViz is Matrix View.

Each cell in the matrix corresponds to the query-key joint embedding for a single attention head. Rows correspond to model layers, moving from earlier layers at the top of the interface to later layers at the bottom.
Explore different transformer models and projection methods via the dropdown menus. Click the icon to the right of each menu for more information about the available options.
There is another dropdown menu to switch between different color encodings. The current color scheme is displayed via the legend bars below the dropdown. Scatterplots can also be viewed in 2D or 3D.
Search for tokens via the search bar. Results will be highlighted for each attention head in the matrix (see right). For language transformers, you can search for words (e.g., cat, april), and for vision transformers, you can search for objects (e.g., person, bg).
Zoom into a specific attention head with the dropdown menu or by clicking on any plot in the matrix. This will open Single View.

Example results from global search in Matrix View

As in Matrix View, users can switch between different models, projection methods, color encodings, etc. in Single View. Token searches can be conducted at the single head level as well.

Single View shows the query-key joint embedding for a particular attention head in more detail. Open Sentence/Image View by clicking on a point in the scatterplot.
Turn on/off token labels to see what each point represents; more labels appear as you zoom in (see right). When Sentence/Image View is active, attention lines can also be projected onto the scatterplots to visualize the top 2 strongest attention connections between query-key pairs.
For language transformers, query/key points can be scaled by their corresponding embedding norms.
In addition to the "Zoom to Layer" feature at the top of the screen, navigate to adjacent attention heads via the directional controls. The up/down arrows will move up or down one model layer, and the left/right arrows will move to the previous/next head in the current layer.
Return to Matrix View by clicking the "view all heads" button.

Example of a semantic relationship between query-key pairs as elucidated by token labels

When the user clicks on a point in Single View, Sentence/Image View will open in the right sidebar of the interface.

Sentence View (Language Transformers)

View the fine-grained attention patterns in a single sentence with a bipartite graph visualization. The clicked token is highlighted, and the opacity of the lines connecting query-key pairs signifies their corresponding attention strengths. These connections are mirrored by the yellow attention lines on the main scatterplot. Hovering on a token in Sentence View highlights token-specific attention lines.
Filter out noise from special tokens (i.e., [cls] and [sep] in BERT or the first token in GPT-2) by toggling the checkboxes. Clicking on query/key tokens in the bipartite visualization will also toggle the corresponding attention lines on/off. Reset changes with the "reset" button.
Explore aggregate attention patterns (averaged across all sentences) in the current head with the aggregate sentence visualization. This view can be optionally hidden.
Return to Single View by pressing the "clear selection" button. This will close Sentence View.

Image View (Vision Transformers)

View the fine-grained attention patterns in a single image. The "attention to selected token" visualization is an attention heatmap for the selected image patch, where opacity indicates attention strength.
Explore other image-level visualizations by using the dropdown menu (see right). Selecting "highest attention to each token" overlays arrows on top of the original image to indicate the strongest attention connection between a starting and destination patch. Selecting "all high attention flows" shows all the strong attention connections (> 0.1) below the original image, offering a more comprehensive view of attention; both opacity and line thickness are indicators of attention strength. Attention to self is indicated by the circle arrow, and attention to the [cls] token is indicated by a square icon.
Return to Single View by pressing the "clear selection" button. This will close Image View.

Different image-level visualizations of attention

Example findings from AttentionViz

With AttentionViz, we uncovered several interesting insights about self-attention in language and vision transformers. A few examples are shared below; for more details, please see our paper.

Hue/brightness specializations

For ViT, we were curious if any visual attention heads specialize in color or brightness based patterns. To test this, we created a dataset with synthetic color/brightness gradient images and loaded the resultant query and key tokens into AttentionViz.

As a result, we discovered one head (layer 0 head 10) that aligns the black-and-white image tokens based on brightness, and another head (layer 1 head 11) that aligns colorful patches based on hue. Our dataset contains color and brightness gradient images in all orientations, and we see similar patches cluster together in the joint embedding space regardless of their position in the original images. The attention heatmap in Image View confirms these findings; tokens pay the most attention to other tokens with the same color or brightness.

Global traces of attention

While exploring BERT, we observed some attention heads with unique, identifiable shapes. For example, in early model layers, we noticed some spiral-shaped plots (e.g., layer 3 head 9). Coloring by token position reveals a positional trend, where token position increases as we move from the outside to inside of the spiral. Sentence View confirms that there is a "next-token" attention pattern.

Similarly, we noticed that plots with "small clumps" also encode positional patterns (e.g., layer 2 head 0). This can be verified by coloring each token by its position mod 5, which forms a more discrete positional color scheme and can be helpful in visualizing relationships between query-key pairs based on small offsets in sentence position. The main difference between "spirals" and "clumps" appears to be whether tokens attend selectively to others one position away vs. at several different possible positions.

Induction heads in BERT?

We were also curious whether AttentionViz could be used to explore induction heads. At a high level, induction heads perform prefix matching and copying on repeated sequences to help language transformers perform in-context learning. For a more comprehensive overview of induction heads and in-context learning, please check out this Anthropic article.

To our knowledge, induction heads have only been studied in unidirectional models like GPT-2, but with AttentionViz, we also discovered potential induction head behavior in BERT, which uses a bidirectional attention mechanism. One head, layer 8 head 2, appears to demonstrate standard copying behavior, where a token A (e.g., - ) pays attention to the token B (e.g., 8 or 10) that came before it in a previous occurence. The example below also shows how each A can attend to multiple Bs. Since BERT is bidirectional, it can perform copying in both directions as well.

Another head, layer 9 head 9, seems to be a potential "reverse" induction head. In this case, a token A pays attention to the token B that came after it in another occurence. More work is needed to validate these observations, but our findings support the possibility of induction heads and in-context learning in bidirectional transformers like BERT.