Arguments to mod to alter dataset:

  1. clean dataset
    1. prompt templates
    2. specific subjects (make into adjectives)
  2. corrupted dataset
    1. how to flip specific adjectives of clean
    2. It shouldn’t actually matter what names you use, so just one sentence is fine.
  3. circuit (DONE)
  4. layer_heads (DONE)

Try to make one for tall/short and other size comparisons:

tall_short_circuit.ipynb

https://colab.research.google.com/drive/1cFJc2Zc1fh_BXV42q3h4zfvRikINE_Mo#scrollTo=wsgXK6jvbE-s


Customize your own logit diff:

logit_diff() is used in plot_path_patching, but that measures S and IO. Now, you're not doing that, so modify this function to your own custom one.

https://github.com/redwoodresearch/Easy-Transformer/blob/main/easy_transformer/ioi_utils.py

Before, the corrupted subbed the IOI with random names. Here, the corrupted should switch the size adjective to its antonym.

May just end up using logit_diff from exploratory analysis

What does it mean to run circuit_logit_diff = logit_diff(model, dataset) ?

The logit_diff function calculates the difference in logits between the input-output (IO) tokens and the substitute (S) tokens in a given model's output.

We run logit_diff before circuit extraction, then after, and compare the two logit_diffs. After patching, the logit diff should be lower, since it has a worse ability separating the correct and incorrect.