uiniverse

AI Eval Results

Does feeding an LLM 500 lines of source code beat a 30-line JSON descriptor? We tested 6 components across 2 models with 72 eval runs.

6.0x
Fewer tokens with descriptor
+-2pp
Average quality improvement
2
Models tested (cross-provider)

Live Comparisons

See AI-rendered components side by side — raw source vs descriptor:

Model Comparison

ModelRaw ScoreDescriptor ScoreRaw TokensDesc TokensToken RatioTS Validity (Raw/Desc)
Gemini 2.5 Flash97%95%50238945.6x100% / 100%
Claude Sonnet 494%93%51167976.4x100% / 100%

Analytics

Quality Score by Model

Token Consumption by Model

Per-Component Breakdown

Methodology

Components Tested

  • Counter
  • CircularGallery
  • InfiniteMenu
  • SoftAurora
  • FlowingMenu
  • ShapeGrid

Conditions

  • raw-source-only — LLM receives full component source code (170-2400 lines)
  • descriptor-only — LLM receives compact AI descriptor JSON (~30-90 lines)

Prompts

  • basic-usage — render with defaults
  • prop-customization — modify 2-3 props
  • complex-usage — realistic production scenario

Scoring Metrics

  • Prop correctness — are real props used?
  • Import correctness — correct import path?
  • TypeScript validity — valid imports, JSX, exports?
  • Overall — average of all metrics

Reproduce

pnpm eval:comparative
pnpm eval:comparative:gemini
pnpm eval:report:comparative

Detailed Reports

Full reports with side-by-side generated code comparisons: