Join us for an evening of fun at this month’s hack::soho taking place 26 February, 6pm – 9pm GMT, set up to be a loose networking environment where cyber security professionals can chat, get some complimentary food & drink, and discuss rising global trends.
This month’s hack::soho features a talk from Stjepan Picek, professor at the University of Zagreb. The abstract of the talk, ‘Safety-Neuron-Based Attacks on LLMs,’ is below!
hack::soho is a monthly event hosted at our London, UK office for the cybersecurity and hacking community to discuss all things security over food and refreshments. We welcome you to invite others in your circle to extend our collective network.
Spots are limited, so please use real contact details to confirm your registration. We will not sell, distribute, or use your contact information outside of sending you details about upcoming hack::soho meetups.
ABSTRACT
Large language models (LLMs) achieve state-of-the-art performance across many tasks, but their widespread deployment raises urgent security, privacy, and misuse concerns. Building on recent progress in sparse mechanistic interpretability—particularly results from vision models—this talk explores the hypothesis that a small set of neurons or features is disproportionately responsible for safety-aligned behavior in LLMs. I present methods to identify such sparse, interpretable substructures and evaluate how manipulating them at inference time can degrade safety behavior in both white-box and black-box settings.
I then extend this perspective to Mixture-of-Experts (MoE) models, introducing a training-free, lightweight, and architecture-agnostic framework for probing and stress-testing the safety alignment of modern MoE LLMs during inference. Finally, I discuss broader implications and applications of “safety features,” including safety-relevant behavior in code-generation models and the resulting opportunities for more robust alignment and defense.
PRESENTER’S BIO:
Stjepan Picek is a full professor at the University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia.
He also holds an associate professor position at Radboud University, Nijmegen, and an adjunct professor position at the University of Bergen, Norway.
Before that, he was an assistant professor at TU Delft and a postdoctoral researcher at MIT, USA, and KU Leuven, Belgium. Stjepan completed PhD in computer science in 2015 at the University of Zagreb, Croatia and Radboud University, The Netherlands. In 2024, he finished a PhD in mathematics at the University of Paris 8, France.
His research interests include security and cryptography, machine learning, and evolutionary computation.
To date, Stjepan has given more than 60 invited talks and published more than 200 refereed papers. He is a program committee member and reviewer for a number of conferences and journals and a member of several professional societies. His work has been featured in the mainstream media and on popular technology blogs. He is a member of ELLIS and a Fellow of the Young Academy of Europe.
