A Mechanistic Switch for Heuristic vs. Algorithmic Processing in Small Language Models

Co-authored by Gemini :)

Executive Summary

Small Language Models (<1B parameters) exhibit a robust failure mode: they correctly answer factual questions when prompted to "Think step by step" (CoT), but frequently hallucinate when given the simpler instruction to "Give answer in one word" (GAOW). This kind of hallucination is particularly exhibited by LLM if the input prompt is a sequential QA of facts. This project investigates the mechanistic basis of this instruction-dependent failure. I hypothesize that the GAOW instruction activates a shallow "heuristic pathway" that bypasses the model's more robust factual recall circuits.

Using causal interventions on Qwen-family models (Qwen1.5-0.5B-Chat,Qwen3-0.6B), I present strong evidence for this hypothesis. A comparative Logit Lens analysis (Figure 1) shows that the computational pathways for the two instructions diverge in the mid-later layers (L5-L16), immediately following the processing of the instruction tokens. Direct Logit Attribution and Activation Patching (Figure 2, 4) confirm that a sparse set of late-layer MLPs and attention heads are causally sufficient for recalling the correct fact in the CoT pathway.

Crucially, attention pattern analysis (Figure 3) provides a direct mechanistic link: heads that promote the correct answer (e.g., L22.H12) attend to factual keywords like "capital" and "America", while heads that suppress it (e.g., L19.H4) attend directly to the GAOW instruction tokens like "one" and "word". This suggests the existence of a mechanistic "switch" that selects between reasoning modes, providing a promising target for interventions to improve model reliability.

I ran the experiments on the Qwen1.5-0.5B-Chat and Qwen3-0.6B because both models exhibit a similar hallucination. I found similar plots for both the models which shows that this hallucination is caused by the similar set of heads and layers. Models like Gemma3-270M and SmolLM-360m also exhibit the similar hallucination but unfortunately I can’t run the experiments for those because of the limited compatibility of Transformerlens library with models.

I also notice that SLMs are not good at instruction following. A normal prompt(without CoT) and CoT prompt have no difference in the logit lens. Even their attention patterns remain the same. However, as obvious, CoT does improve the model performance.

High-Level Takeaways

Instruction-Dependent Failure: The factual hallucination is not a general knowledge deficit but is causally tied to the "Give answer in one word" instruction, which fails to engage the model's correct reasoning pathway. This behavior is confirmed to be general across several SLMs (<1B).
Mid-Layer Divergence: The computational paths for CoT and GAOW reasoning diverge in early layers (L15-L16), as shown by a comparative Logit Lens analysis.
Late-Layer Factual Execution: A sparse set of late-layer MLP blocks (L18-L23) and attention heads (e.g., L22.H12) are causally responsible for the final factual recall in the successful CoT pathway.
A "Heuristic-Trigger" Mechanism: Negative heads (e.g., L19.H4) directly attend to the GAOW instruction tokens, providing a direct mechanistic link between the prompt and the engagement of a simpler, incorrect pathway.
Circuit Inconsistency: Path patching reveals that the heuristic and reasoning pathways are deeply inconsistent. Attempting to patch a single "clean" signal from the reasoning pathway into the "heuristic" pathway context confuses the model and degrades performance further, suggesting a global, coordinated state change rather than a simple modular circuit.

1. Motivation and Background

The reliability of language models is critically dependent on their ability to follow instructions faithfully. While large models are increasingly capable, smaller, more accessible models often exhibit surprising failures. This project investigates one such failure: a consistent, instruction-dependent factual hallucination in SLMs (<1B). This work is inspired by recent advances in identifying and manipulating high-level behaviors via activation engineering, such as the discovery of a "refusal direction" [Arditi et al.]. By contrast, this project explores a different behavioral axis: the model's choice between a deep, algorithmic reasoning process and a shallow, heuristic one. Understanding the mechanism behind this "switch" is a key step toward building models that can be reliably steered towards safer, more robust reasoning.

2. Experimental Setup

Models: The primary model for this study is Qwen/Qwen1.5-0.5B-Chat. Key results were replicated on qwen3-0.6B to ensure generality. All analysis was conducted using the TransformerLens library.
Dataset: A few-shot prompt was constructed using factual question-answer pairs. The key manipulation is the system prompt:
- Clean/CoT Prompt: "You are a helpful assistant. Think step by step."
- Corrupted/GAOW Prompt: "You are a helpful assistant. Give answer in one word." Figure: Input prompt and model response