https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
What about the 1600-dim vectors produced in the middle of the network, say the output of the 12th layer or the 33rd? If we convert them to vocab space, do the results make sense? The answer is yes.
Convert to vocab space:
logits = model_small.unembed(model_small.ln_final(o))
most_recent_S_name_movers_DRAFT.ipynb
z_0.shape = [1, 6, 1280]
Overall, z_0
has a shape of [batch_size, seq_length, embedding_size]
,
The second dimension (6
) represents the seq_length
logits:[batch_size, sequence_length, vocab_size]
,
logits[seq_idx, ioi_dataset.word_idx[word][seq_idx]]
:
ioi_dataset.word_idx[word][seq_idx]
retrieves the indices for the specific word in the current sequence using the word_idx
attribute of ioi_dataset
.
Eg) Subject S is “mary”, so finds the index of mary. This gives the logit of mary for the prompt sequence
Figure out the dims for z in copy_scores to see how skipping QK matrix “fully attends to the name tokens”
Specifically, we first obtained the state of the residual stream at the position of each name token after the first MLP layer. Then, we multiplied this by the OV matrix of a Name Mover Head (simulating what would happen if the head attended perfectly to that token),
This is z
“position of each name token” : model(ioi_dataset.toks.long())