Lecturer:
Dr Gray Manicom (Stellenbosch University)
Course dates:
1 June – 28 August 2026
Abstract:
This postgraduate course introduces students to the new technology of Large Language Models (LLMs), how they are trained, how they work and topics related to safety, with a focus on mechanistic interpretability. The course is broken into three modules with the central theme of using mechanistic interpretability techniques to understand and engineer already-trained LLMs for the purposes of safety.
The first model introduces students to the supervised and unsupervised machine learning techniques that are used to train Generative Pre-trained Transformer (GPT) LLMs. This will cover topics such as training data, transformer architecture, multi-layer perceptrons, attention layers, the residual stream, reinforcement learning with human feedback, prompt engineering and compute/parallelisation. This high-level theoretical overview is paired with hands-on technical aspects where students will train or fine-tune their own LLMs using existing (clean) datasets.
In the second module we dive deeper into the mechanics of LLMs by developing an understanding of techniques in the field of mechanistic interpretability. Techniques from this emerging field can be used to open the "black box" of LLMs and identify circuits and features that are used in computations, factual recall and other such functions within LLM models. This will cover topics such as features, circuits, polysemanticity vs monosemanticity and superposition, sparse autoencoders (SAEs) and latents, feature detection and causation, universality and motivs, and interventions such as activation patching and feature clamping. Practical components will include using existing tools like Gemma Scope, Automatic Circuit Discovery and pre-trained SAEs.
The third module covers theoretical topics related to AI safety including bias, alignment, sycophancy, agentic systems, reliance and overuse and existential threats. These theoretical discussions are paired with practical tutorials or demonstrations where students will be able to illicit unsafe responses from LLMs and see what tools and training procedures are used to mitigate those risks. These interventions will include techniques from mechanistic interpretability to produce mechanistic causal explanations of where unsafe behaviours originate, and engineer different interventions to mitigate those unsafe behaviours.
This course will give students both a theoretical and a hands-on understanding of the revolutionary technology of the past few years, framed from the perspective of safety. Theoretical and technical skills from the study of interpretability, and mechanistic interpretability in particular, will help students understand how to use LLMs safely, responsibly, and how they can be engineered post-training to ensure that the models align with human interests and values.
Outcomes:
Students are expected to:
- Develop a good understanding of how LLMs are trained and how they compute next token probabilities.
- Develop a deep understanding of different aspects of AI safety.
- Be able to causally explain LLM computations and to change their behaviour with different mechanistic interventions.
- Develop the skills to engineer safe AI and to hone LLMs for specific use cases.
Lecture format:
Lectures will be given live and delivered online using Microsoft Teams or Zoom, followed by a discussion. Lecture recordings will be shared after each lecture.
Each lecture will begin with a short recap of the previous lecture and will end with a summary of the main ideas of the lecture.
Synchronous contact:
Lectures will be live and recorded, allowing synchronous contact during and after lectures. There will be at least two in-person tutorial sessions per week, giving students two opportunities for practical contact.
Student assessment:
Students will have to complete a variety of activities across the different modules. There will be online tutorial tests to test theoretical knowledge, where students can re-submit answers to achieve a high mark. They will also need to submit ipython notebooks (or Google Colabs) containing code to demonstrate their work and experiments for practical knowledge. These notebooks can be submitted individually or in groups of up to 3 members.
Their final assignment will combine knowledge from througout the programme in a written report (with supporting code) that demonstrates exploratory research into an LLM to investigate and intervene in a safety issue of their choice. For this they can work alone or in groups of up to 4 members. They will need to explain the safety issue, covering important theoretical aspects of how LLMs are trained and why that safety issue exists and is important, and then demonstrate how one might intervene to mitigate those risks. The assessment will not be results driven, but rather tests that the correct methodologies were applied within an appropriate framework of interpretability.
Should the number of groups be small (less than 20), then they will be required to present their final assignment online.
Lecturer's biography:
Gray Manicom is an early career researcher at the Policy Innovation Lab in the School of Data Science and Computational Thinking at Stellenbosch University. He holds a PhD in Mathematics from the University of Auckland and has expertise in dynamical systems, agent-based models on networks, and AI. His current work focuses on mechanistic interpretability, the use of AI in policymaking, and the development of policies for AI. He lectures modules in the Department of Mathematical Sciences master’s programme in AI and provides AI training to policymakers and other government officials in South Africa.
E: graym@sun.ac.za