Abstract | Notion

Previous work in neural network interpretability research has discovered evidence of circuit subgraphs which explain how information is processed by various components of a model to perform natural language tasks. In this study, we expand upon this work by uncovering an expanded variety of circuit types within GPT-2 models for natural language tasks generally involving comparison, ranging from subject comparison to size comparison. Our approach first applies causal interventions to discover important attention heads, then subsequently employs techniques to decipher the function of these heads in moving information. Additionally, we investigate the contribution of MLPs in processing knowledge. We combine our discoveries on the functionalities of attention heads and MLPs to deliver explanations on how circuits accomplish natural language tasks.

[summarize experimental results in 1-2 sentences here]

[conclude by re-iterating the novelty and usefulness (for future work) of this study]