Developing scalable ways to understand deep learning. Especially excited about
using (mechanistic) interpretability to help improve safety and reliability of neural networks. Current
interests include automated interpretability, rigourous interpretability evals, concept bottleneck models (CBMs) and sparse autoencoders (SAEs).
PhD student at UC San Diego advised by Prof. Tsui-Wei (Lily) Weng.
Bachelor of Science in Computer Science and Engineering and in Philosophy from MIT
Google Scholar /
Github /
email: toikarinen@ucsd.edu