Chain-of-Thought (CoT) reasoning is often treated as a transparency tool, but models frequently produce "unfaithful" reasoning-post-hoc rationalizations that justify a biased answer rather than explaining the true causal process. This project investigates the mechanistic differences between faithful reasoning and post-hoc rationalization. I hypothesize that while faithful reasoning exhibits a connected causal topology (steps causally influencing future steps), unfaithful reasoning exhibits a "broken" or disconnected topology, where the reasoning steps are mechanistically irrelevant to the final output.
Using causal interventions on Qwen-2.5-7B-Instruct over a dataset of 218 traces, I present preliminary evidence supporting this hypothesis. A comparative analysis of Sentence-Level Causal Matrices (Fig. 2 & 3) reveals that faithful traces form dense, sequential causal chains, whereas unfaithful traces appear topologically sparse or disconnected. Quantitative analysis (Fig. 4) shows that faithful traces possess significantly higher "Chain Strength" (step-to-step causal impact) than unfaithful ones.
Crucially, I identified a circuit-level mechanism driving this divergence: "Receiver Attention Heads." In faithful traces, these heads attend primarily to the model's own generated reasoning (50.0%). In unfaithful traces, these heads shift their attention significantly toward the biasing context in the prompt (7.8% vs 4.9%), effectively "ignoring" the generated CoT. This suggests unfaithfulness is not just a behavioral artifact but a distinct circuit-level state where the model decouples planning from execution. Fig. G shows the “receiver head attention allocation” across top receiver heads. It proves that for faithful CoT traces receiver heads allocate more attention to reasoning compared to unfaithful CoT traces.
All the data and the notebooks written for this project can be found here: https://github.com/angkul07/Topological-Signatures-of-Deception

Fig. 0: An analysis of causal impact over time reveals that reasoning importance is heavily front-loaded. Regardless of faithfulness, the sentences with the highest outgoing influence (Thought Anchors) cluster in the first quartile of the trace (t < 0.25), with impact decaying linearly as the generation progresses. This suggests that the model’s trajectory-whether a valid logical path or a deceptive rationalization-is largely determined by initial "Problem Setup" and "Planning" steps. Consequently, future safety interventions (e.g., steering vectors or monitors) should prioritize the early-stage reasoning, as this is where the causal structure is solidified.
As reasoning models (like DeepSeek-R1 or OpenAI o1) become more capable, "steered" or "sycophantic" reasoning becomes a critical safety failure mode. If a model generates a benign-looking CoT to justify a harmful outcome, human monitors will be deceived. Existing research (Turpin et al., 2023; Lanham et al., 2023) has shown that models rationalize, but we lack rigorous metrics to detect when it happens without knowing the ground truth.
Building on the "Thought Anchors" framework (Bogdan et al., 2025), I propose moving beyond textual analysis to mechanistic topology. By treating the reasoning trace as a causal graph, we can detect when the "logic" is fake because the causal links between sentences-and the attention heads that maintain them-are broken.
Model & Data Generation: I utilized Qwen-2.5-7B-Instruct for all experiments due to its strong reasoning capabilities and open-weights architecture. To isolate the variable of "faithfulness," I generated a controlled dataset of N=218 paired reasoning traces (109 Faithful, 109 Unfaithful) using prompts from the "CoT in the Wild" dataset (Arcuschin et al., 2025).