Skip to content

Engineering Blueprint

LLM-as-a-judge to benchmark software platform experience

Alexis Z.CPTOG2May 2026

Benchmark software platform UX across multiple apps using dual LLM-as-judge and synthetic-user scoring passes that validate each other, generating quantified performance comparisons and interactive da

1 File Included

  • Untitled.txt

    12 KB

What problem does this solve?

Evaluate software platform performance and benchmark UI/UX experiences across multiple apps and websites in a defensible, quantified way. The framework uses two complementary scoring passes — a calibrated LLM-as-judge pass grounded in a behavioral rubric, and synthetic users built from real-world sentiment data — to pressure-test each other and surface gaps neither approach would catch alone.

How does it work?

Run two complementary scoring passes that pressure-test each other: (1) LLM-as-Judge pass: Define 5 personas and 4 behavioral rubric dimensions, build an evidence corpus from web fetches and third-party commentary, run Claude to score 100 cells (5 platforms × 5 personas × 4 dimensions) with rationale and cited evidence. (2) Synthetic-User Validation pass: Construct synthetic users grounded in real-world sentiment data (Trustpilot, Reddit, vendor reviews), score the same 100 cells with persona-specific friction biases and cell-specific adjustments. (3) Compare: Calculate Pearson r, MAE, mean signed delta, and per-cell divergences. (4) Deliver: Generate an editorial-style HTML dashboard with leaderboard, heatmap, per-platform deep-dives, validation metrics, and a markdown report companion.

What's the biggest win?

Produces a defensible, pressure-tested audit artifact (dashboard + report) that executives can act on. The dual-pass validation catches gaps the rubric misses and confirms or sharpens recommendations. Typical agreement (Pearson r > 0.85) signals that both passes are reading the same reality.

What should I know technically?

Judge system prompt requires: scoring from persona perspective, grounding every score in 1-3 cited evidence pieces, calibrating against behavioral anchors (1-3-5, not adjectival), outputting strict JSON with score/rationale/evidence per cell. Synthetic-user scoring formula: `judge_score + persona_friction_bias + cell_specific_adjustment`, clamped [1, 5] and rounded to 0.5. Evidence corpus structure: Discovery Mechanics / Content Depth / Signal vs Noise / Actionability / Persona-Specific Notes per platform. Synthetic user construction: include grounded_in sources, walking_in_priors, friction_score_bias per dimension, and voice_library for rationales. Live runs via `run_judgment_live()` with ANTHROPIC_API_KEY; cost ~$2-5 per 100-cell run on Opus.

What are the constraints?

Framework only works for platforms observable via public web fetches; not suitable for single-platform reviews or quantitative usability testing with real users. Users can test different rubric frameworks and dimension sets, but swapping dimensions mid-audit requires re-running the full judge and synthetic-user scoring passes — a constraint worth planning for in timelines.

Tools in this Blueprint

Claude logo
4.7(315 reviews)
TrustPilot

About This Blueprint

Industry
Computer Software