Arguments to mod to alter dataset:
Try to make one for tall/short and other size comparisons:
tall_short_circuit.ipynb
https://colab.research.google.com/drive/1cFJc2Zc1fh_BXV42q3h4zfvRikINE_Mo#scrollTo=wsgXK6jvbE-s
Customize your own logit diff:
logit_diff() is used in plot_path_patching, but that measures S and IO. Now, you're not doing that, so modify this function to your own custom one.
https://github.com/redwoodresearch/Easy-Transformer/blob/main/easy_transformer/ioi_utils.py
Before, the corrupted subbed the IOI with random names. Here, the corrupted should switch the size adjective to its antonym.
May just end up using logit_diff from exploratory analysis
What does it mean to run circuit_logit_diff = logit_diff(model, dataset) ?
The logit_diff
function calculates the difference in logits between the input-output (IO) tokens and the substitute (S) tokens in a given model's output.
We run logit_diff before circuit extraction, then after, and compare the two logit_diffs. After patching, the logit diff should be lower, since it has a worse ability separating the correct and incorrect.