Reverse engineering the Acronym task in GPT-2 small (WIP)
AISF Project, 2024.
This project is based on one of the suggested projects in Neel Nanda’s 200 open problems in mechanistic interpretability, completed for the Bluedot Impact’s AI safety fundamentals course on AI alignment.
We attempt to reverse engineer the task of predicting the 3-letter acronym for a 3-word sequence in GPT-2 small. We isolate some interesting attention heads and MLP layers, and make some hypotheses about how they work. We leave more rigorous explorations to future work.