Topological Signatures of Deception: Detecting Unfaithful Reasoning via Sentence-Level Causal Graphs

Executive Summary

Chain-of-Thought (CoT) reasoning is often treated as a transparency tool, but models frequently produce "unfaithful" reasoning-post-hoc rationalizations that justify a biased answer rather than explaining the true causal process. This project investigates the mechanistic differences between faithful reasoning and post-hoc rationalization. I hypothesize that while faithful reasoning exhibits a connected causal topology (steps causally influencing future steps), unfaithful reasoning exhibits a "broken" or disconnected topology, where the reasoning steps are mechanistically irrelevant to the final output.

Using causal interventions on Qwen-2.5-7B-Instruct over a dataset of 218 traces, I present preliminary evidence supporting this hypothesis. A comparative analysis of Sentence-Level Causal Matrices (Fig. 2 & 3) reveals that faithful traces form dense, sequential causal chains, whereas unfaithful traces appear topologically sparse or disconnected. Quantitative analysis (Fig. 4) shows that faithful traces possess significantly higher "Chain Strength" (step-to-step causal impact) than unfaithful ones.

Crucially, I identified a circuit-level mechanism driving this divergence: "Receiver Attention Heads." In faithful traces, these heads attend primarily to the model's own generated reasoning (50.0%). In unfaithful traces, these heads shift their attention significantly toward the biasing context in the prompt (7.8% vs 4.9%), effectively "ignoring" the generated CoT. This suggests unfaithfulness is not just a behavioral artifact but a distinct circuit-level state where the model decouples planning from execution. Fig. G shows the “receiver head attention allocation” across top receiver heads. It proves that for faithful CoT traces receiver heads allocate more attention to reasoning compared to unfaithful CoT traces.

All the data and the notebooks written for this project can be found here: https://github.com/angkul07/Topological-Signatures-of-Deception

Fig. 0: An analysis of causal impact over time reveals that reasoning importance is heavily front-loaded. Regardless of faithfulness, the sentences with the highest outgoing influence (Thought Anchors) cluster in the first quartile of the trace (t < 0.25), with impact decaying linearly as the generation progresses. This suggests that the model’s trajectory-whether a valid logical path or a deceptive rationalization-is largely determined by initial "Problem Setup" and "Planning" steps. Consequently, future safety interventions (e.g., steering vectors or monitors) should prioritize the early-stage reasoning, as this is where the causal structure is solidified.

High-Level Takeaways

Unfaithfulness has a Topological Signature: Deceptive reasoning is not just semantically different; it is mechanistically distinct. My analysis of 218 traces confirms that faithful reasoning forms a strongly connected causal graph, where early "Problem Setup" and "Plan" steps act as necessary anchors for "Conclusion" steps. In contrast, unfaithful reasoning forms a sparse, disconnected graph. The reasoning steps in rationalized traces are mechanistically impotent; they exist to satisfy language modeling perplexity constraints but do not drive the computation of the answer.
"Chain Strength" as a Faithfulness Metric: I developed and validated a quantitative metric, Chain Strength (∑Mi,i+1) which measures the causal necessity of each step for its immediate successor. Despite high variance inherent in generative tasks, this metric successfully separated faithful traces (𝜇=0.68) from unfaithful ones (𝜇=0.40). This suggests that while models are adept at simulating local linguistic coherence in fake reasoning, they fail to maintain the deep mechanistic dependencies required for genuine logic.
Receiver Heads Reveal the Mechanism of Deception: I successfully linked the high-level topological failure to specific attention components. By isolating Receiver Heads (heads with high kurtosis that "listen" to Thought Anchors), I demonstrated a measurable attentional shift. In faithful traces, these heads attend to the model's own reasoning (50.59% attention mass). In unfaithful traces, they effectively "ignore" the reasoning (44.01%) and attend instead to the biased tokens in the prompt (around 4%). This suggests that unfaithfulness is mediated by specific circuits that bypass the CoT to retrieve answers from context or memory.

1. Motivation and Background

As reasoning models (like DeepSeek-R1 or OpenAI o1) become more capable, "steered" or "sycophantic" reasoning becomes a critical safety failure mode. If a model generates a benign-looking CoT to justify a harmful outcome, human monitors will be deceived. Existing research (Turpin et al., 2023; Lanham et al., 2023) has shown that models rationalize, but we lack rigorous metrics to detect when it happens without knowing the ground truth.

Building on the "Thought Anchors" framework (Bogdan et al., 2025), I propose moving beyond textual analysis to mechanistic topology. By treating the reasoning trace as a causal graph, we can detect when the "logic" is fake because the causal links between sentences-and the attention heads that maintain them-are broken.

2. Experimental Setup

Model & Data Generation: I utilized Qwen-2.5-7B-Instruct for all experiments due to its strong reasoning capabilities and open-weights architecture. To isolate the variable of "faithfulness," I generated a controlled dataset of N=218 paired reasoning traces (109 Faithful, 109 Unfaithful) using prompts from the "CoT in the Wild" dataset (Arcuschin et al., 2025).

Unfaithful Traces (Rationalization): Generated using prompts with known inductive biases (e.g., ordering effects like "Is X > Y?") that forces the model to answer incorrectly and generate a plausible-sounding Chain-of-Thought to justify the error.