Concept Probing in Large LLMs

Highlight. Six questions, one experiment. Across three scales (Phi-2 2.7B, Qwen-2.5-3B, Qwen-2.5-14B) × two problems (farmer max-area, ball-drop impact velocity), we ran four layer-wise probes on matched pairs of correct and incorrect completions from identical prompts. Relations are linearly decodable from layer ¼ onward at every scale (P3 macro-F1 = 0.95–1.00). Role invariance builds mid-depth and strengthens with scale (rename cosine peak: 0.63 at Phi-2 → 0.88 at Qwen-3B → 0.93 at Qwen-14B) but regresses at the output in every cell where the test applies — token identity wins over role by up to −0.17 at 14B. Variables form no cluster structure at any depth in any cell — neither within-variable nor across-category — so concept information is carried by linear directions, not Euclidean separation. Most importantly: all three pre-registered representational failure hypotheses are rejected — the residual stream’s variable, role, and relation structure are essentially identical on correct and incorrect samples. Errors live downstream of the residual stream. And the behavioral name-bias Qwen-3B shows on the farmer problem is a decoding-stage artefact, not a representation one.

Setup — the matched-pair rig

Three frozen models (Phi-2 2.7B / 32 layers, Qwen-2.5-3B / 36L, Qwen-2.5-14B-Instruct / 48L) on two problems: farmer (maximize rectangle area given perimeter P) and ball-drop (impact velocity from height h, g = 9.8). For each (model, problem) cell, 200 paraphrased prompts vary variable names, role identity, domain, and numeric values. Sample K=32 completions per prompt at T=0.7, top-p=0.95, max-new-tokens=512. Label each sample as correct (final numeric claim matches analytical gold within tolerance — 1% farmer, 2% ball), incorrect (extractable number but wrong), or off-topic (no extractable final claim).

A matched pair is one correct + one incorrect sample drawn from the same prompt’s 32 samples. For each pair we teacher-force (prompt + generated tokens) and capture all-layer residual streams. The correct side and the incorrect side of each pair come from the same prompt, same model, same weights — only the sampled trajectory differs, so any mechanistic difference localizes where errors arise. Spec thresholds: ≥ 30 pairs floor, ≥ 60 target.

Cell	Matched pairs / 200	Status
phi2_farmer	191	≫ target
phi2_ball	200	≫ target
qwen3b_farmer	195	≫ target
qwen3b_ball	46	above floor (labeler artefact, see caveats)
qwen14b_farmer	199	≫ target
qwen14b_ball	167	≫ target

Four probes, each run twice per cell (once on correct residuals, once on incorrect):

P1 — within-variable silhouette. k-means on residuals at variable-mention positions, scored against gold role labels {length, width, area, param} (farmer) or {height, velocity, g} (ball). Spec expected emergence past 0.3 at a middle layer.
P1b — across-category silhouette. 4-class alternative: {variable, operator, number, other}. Backup hypothesis — maybe variables pool as a shared category rather than separating by role.
P2 — cosine invariance. Cosine between per-variable mean residuals across paraphrase pairs differing in one axis: rename (same role, different letter), role-swap (same letter, different role — farmer only), domain-swap (same role+letter, different scenario).
P3 — linear relation decoder. 5-fold GroupKFold logistic regression at operator positions, labels {perimeter-eq, area-eq, bound} (farmer) or {kinematic-eq, sqrt-form} (ball). Reports macro-F1.

Q1. Do models represent relations between variables?

Yes — in every cell, on every side, at every scale.

P3 saturates at macro-F1 ∈ [0.95, 1.00] from layer ¼ onward, on both correct and incorrect samples. The residual carries a specific linear direction per relation type that the decoder reads off cleanly regardless of whether the final answer is right.

Cell / side	L0	L¼	L_last	n
phi2_farmer / correct	0.61	0.99	0.98	1086
phi2_farmer / incorrect	0.62	0.98	0.95	1161
phi2_ball / correct	0.36	1.00	1.00	699
phi2_ball / incorrect	0.35	1.00	1.00	763
qwen3b_farmer / correct	0.55	0.95	0.96	788
qwen3b_farmer / incorrect	0.53	0.98	0.98	867
qwen3b_ball / correct	0.43	1.00	0.98	170
qwen3b_ball / incorrect	0.40	1.00	0.99	83
qwen14b_farmer / correct	0.61	1.00	0.99	905
qwen14b_farmer / incorrect	0.62	1.00	1.00	987
qwen14b_ball / correct	0.45	1.00	1.00	940
qwen14b_ball / incorrect	0.44	1.00	1.00	760

Q2. Are symbols represented as role-invariant concepts?

Partially. Role abstraction is built mid-depth and then partially undone at the output.

Concept-consistent prediction: cos(rename) > cos(role_swap) — the length-variable direction should be the same whether the letter is L or x. Token-consistent prediction: the opposite.

Cell	Rename peak (layer)	Rename final	Role-swap final	Domain-swap final
phi2_farmer	0.63 (L½)	0.64	0.75	0.74
phi2_ball	0.82 (L¼)	0.71	—	0.72
qwen3b_farmer	0.88 (L¾)	0.79	0.87	0.86
qwen3b_ball	0.88 (L½)	0.52	—	0.89
qwen14b_farmer	0.87 (L½)	0.77	0.94	0.94
qwen14b_ball	0.93 (L¾)	0.87	—	0.93

Two scale-invariant observations:

Mid-depth role invariance builds through the network. Rename cosine rises from near-zero at L0 to 0.63–0.93 at mid-to-late layers. Scale raises the peak: Phi-2 0.63 → Qwen-3B 0.88 → Qwen-14B 0.93 on ball.
Output layer: token identity reasserts. Wherever role-swap is measured, cos(rename) < cos(role_swap) at the final layer. The letter beats the role. And the gap widens at 14B on farmer (−0.17), so the regression is not a small-model artefact.

Q3. Do variables form distinct clusters?

No — at any depth, in any cell. Neither within-variable nor across-category.

Within-variable silhouette (P1):

Cell	L0	L¼	L½	L¾	L_last
phi2_farmer	0.32	0.19	0.17	0.12	0.08
phi2_ball	0.64	0.48	0.43	0.36	0.25
qwen3b_farmer	0.44	0.37	0.35	0.28	0.17
qwen3b_ball	0.65	0.52	0.52	0.37	0.11
qwen14b_farmer	0.41	0.36	0.35	0.30	0.18
qwen14b_ball	0.55	0.62	0.60	0.53	0.36

Silhouette peaks at L0 (tokenizer artefact) and declines with depth in every cell except qwen14b_ball, which rises slightly to L¼ before declining. Variables are most distinguishable at input, not after computation.

Across-category silhouette (P1b):

Cell	L0	L¼	L½	L¾	L_last
phi2_farmer	+0.13	−0.09	−0.10	−0.09	+0.03
phi2_ball	+0.18	−0.02	−0.04	−0.03	+0.07
qwen3b_farmer	+0.28	+0.15	+0.15	+0.11	+0.12
qwen3b_ball	+0.20	+0.06	+0.07	+0.04	+0.03
qwen14b_farmer	+0.15	+0.07	+0.07	+0.04	+0.08
qwen14b_ball	+0.15	+0.10	+0.06	+0.05	+0.06

Same shape: peaks at L0, declines with depth. Phi-2 farmer goes negative mid-network — variable / operator / number / other categories are less coherent than a random partition at those layers. Neither cluster test emerges anywhere.

Synthesis. P1 + P1b rule out distance-separable clusters. P2 + P3 show clean directional structure. Concept information is carried by linear directions in residual space, not by distance-separable clusters. Reading concepts requires cosine and linear-probe tools, not k-means and silhouette.

Q4. Where do errors come from — representation or downstream?

Downstream of the residual stream. Every probe was run twice per cell, once on correct residuals and once on incorrect. Pre-registered failure hypotheses from probes_design.md:

Hypothesis	Predicted signature	Observed
Binding failure	P1 silhouette drops on incorrect	curves within 0.02 on most cells — rejected
Concept regression	P2 rename cosine drops on incorrect	curves within 0.02 on most cells — rejected
Relation collapse	P3 F1 drops on incorrect	F1 within 0.03 on every cell — rejected

Concrete gaps per cell:

Cell	Largest P1 gap	P2 rename gap	P3 F1 gap
phi2_farmer	0.01	0.00	0.03
phi2_ball	0.07	0.04	0.00
qwen3b_farmer	0.02	0.01	0.02
qwen3b_ball	0.07	0.14	0.01
qwen14b_farmer	0.05	0.03	0.01
qwen14b_ball	0.09	0.04	0.00

The residual stream’s variable, role, and relation representations are essentially identical on correct and incorrect runs. The two partial exceptions both live at the largest cell: qwen14b_ball’s P1 gap of 0.09 is the first faint hint of a representational error signature at 14B on one problem, and qwen3b_ball’s P2 rename gap of 0.14 is notable but lives inside the 46-pair labeler-artefact cell (see caveats).

Q5. Do the findings hold at scale?

Yes for Q1, Q2, Q4; one partial deviation for Q3 on one cell.

Claim	Phi-2 2.7B	Qwen-3B	Qwen-14B	Scale-invariant?
Relations linearly decodable (Q1)	F1 0.99	F1 0.96	F1 1.00	yes
Mid-depth rename cosine rises	peak 0.82	peak 0.88	peak 0.93	yes, strengthens
Final-layer token wins over role (Q2)	gap −0.11	gap −0.08	gap −0.17	yes, unchanged or wider
P1 peaks at L0 (Q3)	yes	yes	yes (farmer) / no (ball)	mostly yes
Correct ≈ incorrect representations (Q4)	Δ ≤ 0.04	Δ ≤ 0.07	Δ ≤ 0.09 (ball only)	mostly yes

Scale strengthens mid-depth concept (rename peak rises) but does not fix the output regression. qwen14b_ball is the one cell showing any 14B-specific deviation.

Q6. Does behavioral name-bias reflect representational name-bias?

No. Behavioral name-bias is a decoding-stage artefact. One cell suffices — qwen3b_farmer.

Measure	Value
Stage 1 behavioral correct-rate spread across name pairs	0.34 (x,y: 46% vs. length,width: 12%)
Stage 2a mechanistic rename cosine peak	0.88 at L¾
Stage 2a gap to role-swap at final layer	−0.08

qwen3b_farmer had the largest behavioral name-bias and the cleanest mechanistic rename invariance. If the representation were name-bound, P2 rename would be low. It is 0.88. The name-bias lives downstream of the residual stream — in how the model commits to surface tokens at decoding time, not in how it represents the problem internally. The same pattern replicates at 14B. This was a conditional the spec pre-registered; it triggered exactly as written.

Caveats

qwen3b_ball headline numbers are a labeler artefact. Visible: 1.1% correct, 46 / 200 matched, 70% off-topic. This is not a Qwen-3B failure on ball-drop. 82% of “off-topic” samples contain a correct \boxed{X} that the labeler regex misses; 65% hit the 512-token cap. Qwen emits answers in \[ \boxed{49.5} \] m/s form; the labeler required v = N or m/s suffix. Mechanistic probes still return clean signals on the 46 pairs retained (P3 F1 = 1.00). A labeler patch plus a longer max_new_tokens would lift visible matches past 150.

Phi-2 off-topic is 100% truncation. Every Phi-2 off_topic sample hits the 512-token cap (farmer 2617/2617, ball 473/473). Phi-2 is verbose; the final numeric claim doesn’t fit. Correct / incorrect counts are floors, not ceilings.

What this gives me — and what it points to next

The concept-level picture is settled for this problem class at this scale range: relations are linear, variables are not clusters, roles abstract mid-depth but regress at output, and the residual stream’s representation of the problem is essentially identical on failing and succeeding runs. Whatever breaks must live downstream — in MLP arithmetic, value binding, or the final decoding step. Four next moves argued for by these findings:

Localize the downstream failure. Layer-wise logit-lens on the final 4 layers, restricted to incorrect runs, should identify where a correct residual turns into a wrong output token.
Causal patching correct → incorrect at the candidate error layer. If patching the residual at layer ℓ flips the emitted number, the corruption step sits at ℓ.
Labeler patch + re-label for qwen3b_ball and qwen14b_ball. Text-only, ~1 min compute; restores ≥ 150 matched pairs per cell for downstream stages.
SAE feature discovery targeted at the late-layer corruption point, not abstract “concept features.”