PRISM Scores

The north star

A good prompt — and a good sub-session — reaches the goal using the minimum necessary tokens, turns, human think time, and response latency, given the current session context.

Everything Prism measures traces back to those four fundamental values. They drive four different improvement levers, so Prism never collapses them into a single “PRISM score”. Instead, each personal score answers a different question.

Score	Question	Unit
Speed	How much time do you spend in focused AI-assisted coding?	Hours/week (with a quality discount)
Skill	How well do you direct the AI and recover from mistakes?	0–100
Efficiency	How many tokens does each active hour cost?	Tokens per active hour (lower is better)

All three are shown on the dashboard hub at /prism, with drill-down pages for each.

Skill

Route: /prism/skill

Skill is a weighted blend of five per-session signals:

Skill = 100 × (0.45·SSE + 0.20·PES + 0.15·IE + 0.10·CRR + 0.10·FC)

Input	Weight	What it measures
SSE — Sub-Session Efficiency	45%	Outcome: did this sub-session reach its goal with minimum tokens/turns/time?
PES — Prompt Efficiency	20%	Per-prompt quality (4-dim rubric: Context Leverage, Information Density, Turn Economy, Ambiguity Cost)
IE — Iteration Efficiency	15%	Turns per active CLI hour — do prompts converge or circle back?
CRR — Context Reset Rate	10%	How often you `/compact` or `/clear` — not too rarely, not too often
FC — Flow Continuity	10%	Minutes of uninterrupted focused coding per session

SSE carries the heaviest weight because it’s the direct outcome. PES is the most coachable predictor — the fastest way to move your Skill score. IE, CRR, and FC are behavioral proxies whose effects also show up inside SSE, so they’re de-weighted to avoid double-counting.

All five inputs are normalized to 0–1 before the weighted sum, then scaled to 0–100. SSE, PES, and IE arrive on a 0–10 scale. CRR (a raw ratio) and FC (raw minutes) are mapped through scaling functions first.

See Prompt Efficiency, Sub-Session Efficiency, and Iteration Efficiency for the full per-input rubrics.

Skill tiers

Range	Tier
90–100	Elite
70–90	Expert
50–70	Proficient
30–50	Practitioner
0–30	Novice

Speed

Route: /prism/speed

Speed_raw   = active_cli_seconds / 3600                         (hours/week)
Speed_final = Speed_raw × max(0.5, 1 − max(0, 0.85 − QR))

Absolute hours per week of focused AI-assisted session time, discounted when Quality Retention (QR) falls below 85%. The discount floors at 50% of raw hours — “ship fast, churn faster” never reads as zero, but it also never reads as pure wall time.

Bands (weekly view):

Range	Band
30h+	Heavy
15–30h	Steady
5–15h	Light
< 5h	Low

Efficiency

Route: /prism/efficiency

Efficiency = Σ(tokens) / Σ(active_cli_hours)

Tokens spent per active hour. Lower is better — fewer tokens for the same time-on-task means clearer prompts, better context leverage, and less exploratory tool use. Efficiency moves automatically when you improve Context Leverage and Information Density at the prompt level.

Session vs. sub-session

Prism evaluates work at two different units:

A session is the whole Claude Code run — the span between /clear events or a process lifetime.
A sub-session is one coherent goal inside a session: from the first prompt on that goal to completion or abandonment.

One session typically contains many sub-sessions. A whole session is too coarse — it mixes debugging, authoring, and planning into one number. A single prompt is too fine — you can’t measure “did it reach the goal” per prompt. One goal = one sub-session = one efficiency score.

Sub-session boundaries are detected by the engine from explicit signals (/clear, /compact, session-id transitions), implicit signals (30-minute idle gap, intent change, file-context pivot), and completion signals (successful verification, explicit new-task prompt).

Letter grade

All scores can also be read as a letter grade. The same 10-tier table is used everywhere in the dashboard — B is baseline (team average lands here), not a middling number.

Grade	Score range (0–10)	Meaning
A+	9.5 – 10.0	Exemplary — reference prompt
A	9.0 – 9.4	Excellent — minimum path in context
A−	8.5 – 8.9	Strong — one dimension could tighten
B+	8.0 – 8.4	Above baseline
B	7.0 – 7.9	Baseline — clear, actionable, production-ready
B−	6.5 – 6.9	Just at baseline
C+	6.0 – 6.4	Minor gaps — likely needs one iteration
C	5.0 – 5.9	Noticeable gaps — coaching signal
D	3.0 – 4.9	Below standard — rewrite expected
F	0.0 – 2.9	Inadequate — high risk of wasted output

B spans a full 1.0 point because most real prompts land in that range — narrower bands around baseline would flicker between adjacent grades for the same underlying behavior. A and D each span 0.5.

Why three scores, not one

A pilot’s cockpit doesn’t have a “flight score” — it has altitude, airspeed, heading, and fuel, four instruments for four independent questions. Prism is the same: you can be fast but sloppy, skilled but slow, or efficient per token while spending very few hours. A single number hides that trade-off; three scores make it explicit.