PRISM Scores
The north star
Section titled “The north star”A good prompt — and a good sub-session — reaches the goal using the minimum necessary tokens, turns, human think time, and response latency, given the current session context.
Everything Prism measures traces back to those four fundamental values. They drive four different improvement levers, so Prism never collapses them into a single “PRISM score”. Instead, each personal score answers a different question.
| Score | Question | Unit |
|---|---|---|
| Speed | How much time do you spend in focused AI-assisted coding? | Hours/week (with a quality discount) |
| Skill | How well do you direct the AI and recover from mistakes? | 0–100 |
| Efficiency | How many tokens does each active hour cost? | Tokens per active hour (lower is better) |
All three are shown on the dashboard hub at /prism, with drill-down pages for each.
Route: /prism/skill
Skill is a weighted blend of five per-session signals:
Skill = 100 × (0.45·SSE + 0.20·PES + 0.15·IE + 0.10·CRR + 0.10·FC)| Input | Weight | What it measures |
|---|---|---|
| SSE — Sub-Session Efficiency | 45% | Outcome: did this sub-session reach its goal with minimum tokens/turns/time? |
| PES — Prompt Efficiency | 20% | Per-prompt quality (4-dim rubric: Context Leverage, Information Density, Turn Economy, Ambiguity Cost) |
| IE — Iteration Efficiency | 15% | Turns per active CLI hour — do prompts converge or circle back? |
| CRR — Context Reset Rate | 10% | How often you /compact or /clear — not too rarely, not too often |
| FC — Flow Continuity | 10% | Minutes of uninterrupted focused coding per session |
SSE carries the heaviest weight because it’s the direct outcome. PES is the most coachable predictor — the fastest way to move your Skill score. IE, CRR, and FC are behavioral proxies whose effects also show up inside SSE, so they’re de-weighted to avoid double-counting.
All five inputs are normalized to 0–1 before the weighted sum, then scaled to 0–100. SSE, PES, and IE arrive on a 0–10 scale. CRR (a raw ratio) and FC (raw minutes) are mapped through scaling functions first.
See Prompt Efficiency, Sub-Session Efficiency, and Iteration Efficiency for the full per-input rubrics.
Skill tiers
Section titled “Skill tiers”| Range | Tier |
|---|---|
| 90–100 | Elite |
| 70–90 | Expert |
| 50–70 | Proficient |
| 30–50 | Practitioner |
| 0–30 | Novice |
Route: /prism/speed
Speed_raw = active_cli_seconds / 3600 (hours/week)Speed_final = Speed_raw × max(0.5, 1 − max(0, 0.85 − QR))Absolute hours per week of focused AI-assisted session time, discounted when Quality Retention (QR) falls below 85%. The discount floors at 50% of raw hours — “ship fast, churn faster” never reads as zero, but it also never reads as pure wall time.
Bands (weekly view):
| Range | Band |
|---|---|
| 30h+ | Heavy |
| 15–30h | Steady |
| 5–15h | Light |
| < 5h | Low |
Efficiency
Section titled “Efficiency”Route: /prism/efficiency
Efficiency = Σ(tokens) / Σ(active_cli_hours)Tokens spent per active hour. Lower is better — fewer tokens for the same time-on-task means clearer prompts, better context leverage, and less exploratory tool use. Efficiency moves automatically when you improve Context Leverage and Information Density at the prompt level.
Session vs. sub-session
Section titled “Session vs. sub-session”Prism evaluates work at two different units:
- A session is the whole Claude Code run — the span between
/clearevents or a process lifetime. - A sub-session is one coherent goal inside a session: from the first prompt on that goal to completion or abandonment.
One session typically contains many sub-sessions. A whole session is too coarse — it mixes debugging, authoring, and planning into one number. A single prompt is too fine — you can’t measure “did it reach the goal” per prompt. One goal = one sub-session = one efficiency score.
Sub-session boundaries are detected by the engine from explicit signals (/clear, /compact, session-id transitions), implicit signals (30-minute idle gap, intent change, file-context pivot), and completion signals (successful verification, explicit new-task prompt).
Letter grade
Section titled “Letter grade”All scores can also be read as a letter grade. The same 10-tier table is used everywhere in the dashboard — B is baseline (team average lands here), not a middling number.
| Grade | Score range (0–10) | Meaning |
|---|---|---|
| A+ | 9.5 – 10.0 | Exemplary — reference prompt |
| A | 9.0 – 9.4 | Excellent — minimum path in context |
| A− | 8.5 – 8.9 | Strong — one dimension could tighten |
| B+ | 8.0 – 8.4 | Above baseline |
| B | 7.0 – 7.9 | Baseline — clear, actionable, production-ready |
| B− | 6.5 – 6.9 | Just at baseline |
| C+ | 6.0 – 6.4 | Minor gaps — likely needs one iteration |
| C | 5.0 – 5.9 | Noticeable gaps — coaching signal |
| D | 3.0 – 4.9 | Below standard — rewrite expected |
| F | 0.0 – 2.9 | Inadequate — high risk of wasted output |
B spans a full 1.0 point because most real prompts land in that range — narrower bands around baseline would flicker between adjacent grades for the same underlying behavior. A and D each span 0.5.
Why three scores, not one
Section titled “Why three scores, not one”A pilot’s cockpit doesn’t have a “flight score” — it has altitude, airspeed, heading, and fuel, four instruments for four independent questions. Prism is the same: you can be fast but sloppy, skilled but slow, or efficient per token while spending very few hours. A single number hides that trade-off; three scores make it explicit.