Skip to content

Search is only available in production builds. Try building and previewing the site to test it out locally.

PRISM Scores

A good prompt — and a good sub-session — reaches the goal using the minimum necessary tokens, turns, human think time, and response latency, given the current session context.

Everything Prism measures traces back to those four fundamental values. They drive four different improvement levers, so Prism never collapses them into a single “PRISM score”. Instead, each personal score answers a different question.

ScoreQuestionUnit
SpeedHow much time do you spend in focused AI-assisted coding?Hours/week (with a quality discount)
SkillHow well do you direct the AI and recover from mistakes?0–100
EfficiencyHow many tokens does each active hour cost?Tokens per active hour (lower is better)

All three are shown on the dashboard hub at /prism, with drill-down pages for each.

Route: /prism/skill

Skill is a weighted blend of five per-session signals:

Skill = 100 × (0.45·SSE + 0.20·PES + 0.15·IE + 0.10·CRR + 0.10·FC)
InputWeightWhat it measures
SSE — Sub-Session Efficiency45%Outcome: did this sub-session reach its goal with minimum tokens/turns/time?
PES — Prompt Efficiency20%Per-prompt quality (4-dim rubric: Context Leverage, Information Density, Turn Economy, Ambiguity Cost)
IE — Iteration Efficiency15%Turns per active CLI hour — do prompts converge or circle back?
CRR — Context Reset Rate10%How often you /compact or /clear — not too rarely, not too often
FC — Flow Continuity10%Minutes of uninterrupted focused coding per session

SSE carries the heaviest weight because it’s the direct outcome. PES is the most coachable predictor — the fastest way to move your Skill score. IE, CRR, and FC are behavioral proxies whose effects also show up inside SSE, so they’re de-weighted to avoid double-counting.

All five inputs are normalized to 0–1 before the weighted sum, then scaled to 0–100. SSE, PES, and IE arrive on a 0–10 scale. CRR (a raw ratio) and FC (raw minutes) are mapped through scaling functions first.

See Prompt Efficiency, Sub-Session Efficiency, and Iteration Efficiency for the full per-input rubrics.

RangeTier
90–100Elite
70–90Expert
50–70Proficient
30–50Practitioner
0–30Novice

Route: /prism/speed

Speed_raw = active_cli_seconds / 3600 (hours/week)
Speed_final = Speed_raw × max(0.5, 1 − max(0, 0.85 − QR))

Absolute hours per week of focused AI-assisted session time, discounted when Quality Retention (QR) falls below 85%. The discount floors at 50% of raw hours — “ship fast, churn faster” never reads as zero, but it also never reads as pure wall time.

Bands (weekly view):

RangeBand
30h+Heavy
15–30hSteady
5–15hLight
< 5hLow

Route: /prism/efficiency

Efficiency = Σ(tokens) / Σ(active_cli_hours)

Tokens spent per active hour. Lower is better — fewer tokens for the same time-on-task means clearer prompts, better context leverage, and less exploratory tool use. Efficiency moves automatically when you improve Context Leverage and Information Density at the prompt level.

Prism evaluates work at two different units:

  • A session is the whole Claude Code run — the span between /clear events or a process lifetime.
  • A sub-session is one coherent goal inside a session: from the first prompt on that goal to completion or abandonment.

One session typically contains many sub-sessions. A whole session is too coarse — it mixes debugging, authoring, and planning into one number. A single prompt is too fine — you can’t measure “did it reach the goal” per prompt. One goal = one sub-session = one efficiency score.

Sub-session boundaries are detected by the engine from explicit signals (/clear, /compact, session-id transitions), implicit signals (30-minute idle gap, intent change, file-context pivot), and completion signals (successful verification, explicit new-task prompt).

All scores can also be read as a letter grade. The same 10-tier table is used everywhere in the dashboard — B is baseline (team average lands here), not a middling number.

GradeScore range (0–10)Meaning
A+9.5 – 10.0Exemplary — reference prompt
A9.0 – 9.4Excellent — minimum path in context
A−8.5 – 8.9Strong — one dimension could tighten
B+8.0 – 8.4Above baseline
B7.0 – 7.9Baseline — clear, actionable, production-ready
B−6.5 – 6.9Just at baseline
C+6.0 – 6.4Minor gaps — likely needs one iteration
C5.0 – 5.9Noticeable gaps — coaching signal
D3.0 – 4.9Below standard — rewrite expected
F0.0 – 2.9Inadequate — high risk of wasted output

B spans a full 1.0 point because most real prompts land in that range — narrower bands around baseline would flicker between adjacent grades for the same underlying behavior. A and D each span 0.5.

A pilot’s cockpit doesn’t have a “flight score” — it has altitude, airspeed, heading, and fuel, four instruments for four independent questions. Prism is the same: you can be fast but sloppy, skilled but slow, or efficient per token while spending very few hours. A single number hides that trade-off; three scores make it explicit.