Size Comparison Congruence

dotprod_size_tokens.ipynb

https://colab.research.google.com/drive/1rch6VaG9O1YFJT1wPjjbXyDgXizGT7WV

dotprod_size_tokens_GPTsmall.ipynb

https://colab.research.google.com/drive/18JcQcn7TKhN-1ULNjqQqvst9yJ6ZDhAA

Working on

Use dot products on more than input embeddings of first layer (w/ neuron outgoing weights)
- Try at different points of residual stream- but that’s mixed with other parts of input

Get avg of large dot prods, compare to avg of random dotprods

Future Work

Figure out how not using QK and just using OV gets “full attention” (look at transformer matrix pipeline diagram and work it out on idroo)
- https://www.notion.so/Query-Key-Value-Matrices-fe92464f6ee24068b6aaa56bb85e903e
- If we skip the attention matrix, that’s like if the attention matrix was the identity matrix. Then that means “fully attend to”. In terms of matrix multiplication summations, what does this look like?
  - It would be Id * v = z, so v = z. The outputs are just the value matrix, and it’s copied. The value matrix contains the “content”
  - If we directly mutliply input embeddings with W_v, we get something that’s not exactly the same as input. HOWEVER, these outputs (matrix v) are unembedded into vocab space, they would be like logit outputs, and each token in vocab has a logit value. If these logit values are strong on certain words, we can say the weight matrix W_v (not v) is “giving content” (attending to? need to think abou this) those tokens?
  - Do weights in the value matrix have a semantic relation to the input vectors? Are they considered "content"?

Run Congruence tests

It’s still uncertain what embeddings at each intermediate output represent. Thus, we will experiment with performing vector similarity comparisons (between inputs and neuron groups) at various output areas to see if they make semantic sense. We start by comparing initial embeddings with various neurons to see if there are any significant patterns, and try to explain if these patterns are caused by things indicating how the model is representing semantic features. Some approaches might not make sense at first, but if they find patterns, they may be onto something that warrants further investigation.

Dot product tokens and feature neurons:

This is how $W_E(X)$ works: ‣