https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

What about the 1600-dim vectors produced in the middle of the network, say the output of the 12th layer or the 33rd? If we convert them to vocab space, do the results make sense? The answer is yes.

Convert to vocab space: logits = model_small.unembed(model_small.ln_final(o))

most_recent_S_name_movers_DRAFT.ipynb

z_0.shape = [1, 6, 1280]

Overall, z_0 has a shape of [batch_size, seq_length, embedding_size],

The second dimension (6) represents the seq_length

logits:[batch_size, sequence_length, vocab_size],

logits[seq_idx, ioi_dataset.word_idx[word][seq_idx]]:

ioi_dataset.word_idx[word][seq_idx] retrieves the indices for the specific word in the current sequence using the word_idx attribute of ioi_dataset.

Eg) Subject S is “mary”, so finds the index of mary. This gives the logit of mary for the prompt sequence


Figure out the dims for z in copy_scores to see how skipping QK matrix “fully attends to the name tokens”

Specifically, we first obtained the state of the residual stream at the position of each name token after the first MLP layer. Then, we multiplied this by the OV matrix of a Name Mover Head (simulating what would happen if the head attended perfectly to that token),

This is z

“position of each name token” : model(ioi_dataset.toks.long())