As the models behind popular systems such as
ChatGPT
and
BingAI, transformers
are taking the world by storm. The transformer neural network
architecture has been used in NLP and computer vision settings,
achieving significant performance improvements across various tasks.
Key to transformers' success is their characteristic
self-attention mechanism, which allows these models to learn
rich, contextual relationships between elements of a sequence. For
example, in the sentence "the sky is blue," we might expect high attention between the words "sky" and "blue," and lower attention
between "the" and "blue."
To compute attention, we first transform each input element (e.g., a
word in a sentence or patch of an image) into a corresponding
query and key vector. At a
high level, the attention between two words or image patches can be
viewed as a function of the dot product between the corresponding
query and key. As such, attention is often visualized with a bipartite
graph representation
(in the language case), where the opacity of each edge indicates the
attention strength between query-key pairs.
This approach works for visualizing attention in single input
sequences, but what happens if we have hundreds or even thousands of
inputs to examine?
Presenting a "global view" of attention...
To address this challenge of analyzing and synthesizing attention
patterns at scale, we propose a global view of transformer
attention. We create this global view by designing a new
visualization technique
and applying it to build an
interactive tool for exploring
attention in transformer models.
Technique
For each attention head
in a transformer, we transform a set of input sequences into their
corresponding query and key vectors, creating a joint embedding in
a high-dimensional space. In this joint embedding space, distance
turns out to be a reasonable proxy for attention weights:
query-key pairs with higher attention weights will generally be
closer together. Using methods such as t-SNE
or UMAP, we visualize this embedding in two or three dimensions,
providing a "global" view of attention patterns.
Please see
our paper
for more details about this technique, including input
normalization.
Tool
Using this joint embedding technique, we created
AttentionViz, an interactive tool for visualizing self-attention patterns at
scale. AttentionViz allows attention exploration at multiple
levels for both language and vision transformers. We currently
support BERT
(language), GPT-2
(language), and ViT
(vision). Some example inputs to AttentionViz are shown below (in
reality, we use many sentences and images to form each joint
query-key embedding!).
The three main interactive views provided by AttentionViz include:
Matrix View
View all the attention heads (i.e., patterns) in a transformer
at once
Read More
Single View
Explore a single attention head in closer detail
Read More
Image/Sentence View
Visualize attention patterns within a single sentence or image
Read More
The initial view in AttentionViz is Matrix View.
Each cell in the matrix corresponds to the query-key
joint embedding for a single attention head. Rows
correspond to model layers, moving from earlier layers
at the top of the interface to later layers at the
bottom.
Explore different transformer models and
projection methods via the dropdown menus. Click
the
icon to the right of each menu for more information
about the available options.
There is another dropdown menu to switch between
different
color encodings. The current color scheme is
displayed via the legend bars below the dropdown.
Scatterplots can also be viewed in 2D or
3D.
Search for tokens via the search bar. Results
will be highlighted for each attention head in the
matrix (see right). For language transformers,
you can search for words (e.g.,
cat,
april), and for
vision
transformers, you can search for objects (e.g.,
person,
bg).
Zoom into a specific attention head with the dropdown
menu or by clicking on any plot in the matrix. This will
open
Single View.
As in Matrix View, users can switch between different models,
projection methods, color encodings, etc. in Single View.
Token searches can be conducted at the single head level as
well.
Single View shows the query-key joint embedding for a
particular attention head in more detail. Open
Sentence/Image View by clicking on a point in the
scatterplot.
Turn on/off token labels to see what each point
represents; more labels appear as you zoom in (see
right). When Sentence/Image View is active,
attention lines can also be projected onto the
scatterplots to visualize the top 2 strongest attention
connections between query-key pairs.
For language transformers, query/key points can
be scaled by their corresponding embedding norms.
In addition to the "Zoom to Layer" feature at the top of
the screen, navigate to adjacent attention heads via the
directional controls. The
up/down
arrows will move up or down one model layer, and the
left/right arrows will
move to the previous/next head in the current layer.
Return to Matrix View by clicking the "view all
heads" button.
When the user clicks on a point in Single View, Sentence/Image
View will open in the right sidebar of the interface.
Sentence View (Language Transformers)
View the fine-grained attention patterns in a single
sentence with a bipartite graph visualization. The clicked
token is highlighted, and the opacity of the lines
connecting query-key pairs signifies their corresponding
attention strengths. These connections are mirrored by the
yellow attention lines on the main scatterplot. Hovering on
a token in Sentence View highlights token-specific attention
lines.
Filter out noise from special tokens (i.e.,
[cls] and
[sep] in BERT or the first
token in GPT-2) by toggling the checkboxes. Clicking on
query/key tokens in the bipartite visualization will also
toggle the corresponding attention lines on/off. Reset
changes with the "reset" button.
Explore aggregate attention patterns (averaged across
all sentences) in the current head with the aggregate
sentence visualization. This view can be optionally hidden.
Return to Single View by pressing the "clear
selection" button. This will close Sentence View.
Image View (Vision Transformers)
View the fine-grained attention patterns in a single
image. The
"attention to selected token" visualization is an
attention heatmap for the selected image patch, where
opacity indicates attention strength.
Explore other image-level visualizations by using the
dropdown menu (see right). Selecting
"highest attention to each token" overlays arrows
on top of the original image to indicate the strongest
attention connection between a starting and destination
patch. Selecting "all high attention flows" shows
all the strong attention connections (> 0.1) below the original image, offering a more
comprehensive view of attention; both opacity and line
thickness are indicators of attention strength.
Attention to self is indicated by the circle arrow, and
attention to the
[cls] token is indicated
by a square icon.
Return to Single View by pressing the "clear
selection" button. This will close Image View.
Example findings from AttentionViz
With AttentionViz, we uncovered several interesting insights about
self-attention in language and vision transformers. A few examples are
shared below; for more details, please see
our paper.
Hue/brightness specializations
For ViT, we were curious if any visual attention heads
specialize in color or brightness based patterns. To test this, we
created a dataset with synthetic color/brightness gradient images
and loaded the resultant query and key tokens into AttentionViz.
As a result, we discovered one head (layer 0 head 10) that aligns the black-and-white image tokens based on
brightness, and another head (layer 1 head 11) that aligns colorful patches based on hue. Our dataset contains
color and brightness gradient images in all orientations, and we
see similar patches cluster together in the joint embedding space
regardless of their position in the original images. The attention
heatmap in Image View confirms these findings; tokens pay the most
attention to other tokens with the same color or brightness.
Global traces of attention
While exploring BERT, we observed some attention heads with
unique, identifiable shapes. For example, in early model layers,
we noticed some spiral-shaped plots (e.g.,
layer 3 head 9). Coloring by token
position reveals a positional trend, where token position
increases as we move from the outside to inside of the spiral.
Sentence View confirms that there is a "next-token" attention
pattern.
Similarly, we noticed that plots with "small clumps" also
encode positional patterns (e.g.,
layer 2 head 0). This can be
verified by coloring each token by its position mod 5,
which forms a more discrete positional color scheme and can be
helpful in visualizing relationships between query-key pairs based
on small offsets in sentence position. The main difference between
"spirals" and "clumps" appears to be whether tokens attend
selectively to others one position away vs. at several different
possible positions.
Induction heads in BERT?
We were also curious whether AttentionViz could be used to explore
induction heads. At a high level, induction heads perform
prefix matching and copying on repeated sequences to
help language transformers perform in-context learning. For a more
comprehensive overview of induction heads and in-context learning,
please check out
this Anthropic article.
To our knowledge, induction heads have only been studied in
unidirectional models like GPT-2, but with AttentionViz, we also
discovered potential induction head behavior in
BERT, which uses a bidirectional attention mechanism. One
head, layer 8 head 2, appears to
demonstrate standard copying behavior, where a token
A (e.g.,
- ) pays attention to the token
B (e.g.,
8 or
10) that came before it in a
previous occurence. The example below also shows how each
A can attend to multiple
Bs. Since BERT is bidirectional, it can
perform copying in both directions as well.
Another head, layer 9 head 9, seems
to be a potential "reverse" induction head. In this case, a token
A pays attention to the token
B that came after it in another
occurence. More work is needed to validate these observations, but
our findings support the possibility of induction heads and
in-context learning in bidirectional transformers like BERT.
Acknowledgments:
We would like to thank Naomi Saphra for suggesting the color by
token frequency option for language transformers. We are also
grateful for all the participants in our user interviews for their
time, feedback, and invaluable insights.