https://colab.research.google.com/drive/1lZQkZ5u2mQxsYyiGaUKbW-XjdHP0GJpA
Does a single, or several, neuron(s) in MLP(s) activate for opposites tall and short? What about big and small; do they have some similar activations to tall and short? Based on ‣
Finding 1: Actv patching on MLPs, then on its neurons
Actv patch: “John is short. Mary is” [ corrupted ]
Generalize how to automate “Finding 1”
Finding 2: Given the significant neurons, check them more in Neuroscope
make fns from prev nb, put into new nb
The attention heads circuit takes information from MLPs. So find specific MLPs they interact with. We need to understand the dimensions of the MLP and embedded tokens to see how they are multiplied with the attention in the circuit.
NOTE: neurons each only have one set of weights. But there are input values into the neuron, and output values out of it. The “output weights” are just the neuron weights. The weights are multiplied by every value in the input vector, and this create an output value for a single neuron.
Brainstorm a workflow from broad to specific to find MULTIPLE neurons for each token and see how they add up with each other and with the attention heads circuit: