Co-authored by Gemini :)

Executive Summary

Small Language Models (<1B parameters) exhibit a robust failure mode: they correctly answer factual questions when prompted to "Think step by step" (CoT), but frequently hallucinate when given the simpler instruction to "Give answer in one word" (GAOW). This kind of hallucination is particularly exhibited by LLM if the input prompt is a sequential QA of facts. This project investigates the mechanistic basis of this instruction-dependent failure. I hypothesize that the GAOW instruction activates a shallow "heuristic pathway" that bypasses the model's more robust factual recall circuits.

Using causal interventions on Qwen-family models (Qwen1.5-0.5B-Chat,Qwen3-0.6B), I present strong evidence for this hypothesis. A comparative Logit Lens analysis (Figure 1) shows that the computational pathways for the two instructions diverge in the mid-later layers (L5-L16), immediately following the processing of the instruction tokens. Direct Logit Attribution and Activation Patching (Figure 2, 4) confirm that a sparse set of late-layer MLPs and attention heads are causally sufficient for recalling the correct fact in the CoT pathway.

Crucially, attention pattern analysis (Figure 3) provides a direct mechanistic link: heads that promote the correct answer (e.g., L22.H12) attend to factual keywords like "capital" and "America", while heads that suppress it (e.g., L19.H4) attend directly to the GAOW instruction tokens like "one" and "word". This suggests the existence of a mechanistic "switch" that selects between reasoning modes, providing a promising target for interventions to improve model reliability.

I ran the experiments on the Qwen1.5-0.5B-Chat and Qwen3-0.6B because both models exhibit a similar hallucination. I found similar plots for both the models which shows that this hallucination is caused by the similar set of heads and layers. Models like Gemma3-270M and SmolLM-360m also exhibit the similar hallucination but unfortunately I can’t run the experiments for those because of the limited compatibility of Transformerlens library with models.

I also notice that SLMs are not good at instruction following. A normal prompt(without CoT) and CoT prompt have no difference in the logit lens. Even their attention patterns remain the same. However, as obvious, CoT does improve the model performance.

High-Level Takeaways


1. Motivation and Background

The reliability of language models is critically dependent on their ability to follow instructions faithfully. While large models are increasingly capable, smaller, more accessible models often exhibit surprising failures. This project investigates one such failure: a consistent, instruction-dependent factual hallucination in SLMs (<1B). This work is inspired by recent advances in identifying and manipulating high-level behaviors via activation engineering, such as the discovery of a "refusal direction" [Arditi et al.]. By contrast, this project explores a different behavioral axis: the model's choice between a deep, algorithmic reasoning process and a shallow, heuristic one. Understanding the mechanism behind this "switch" is a key step toward building models that can be reliably steered towards safer, more robust reasoning.


2. Experimental Setup