Schema Validity
100%
Grounding Score
94%
Hallucination Rate
0%
Arabic Fluency
9.2/10
Avg Latency
2.8s
Test Cases (n=12)
Results from the synthetic 12-stroller dataset evaluation suite.
| Test ID | Intent Evaluated | Products | Result |
|---|---|---|---|
| #001 | Travel stroller under 1000 AED | ST-001ST-004ST-009 | Pass |
| #002 | Newborn friendly, large basket | ST-003ST-008ST-007 | Pass |
| #003 | Apartment living, stairs | ST-001ST-005ST-002 | Pass |
| #004 | Jogging and off-road | ST-012ST-007ST-008 | Pass |
| #005 | Missing data handling (fold_type) | ST-011ST-002 | Pass |
| #006 | Budget options | ST-002ST-004ST-012 | Pass |
| #007 | Luxury travel | ST-006ST-001ST-009 | Pass |
| #008 | Car seat compatible | ST-010ST-003 | Pass |
| #009 | Twins / Double conversion | ST-008ST-007 | Pass |
| #010 | Ultra compact one hand fold | ST-004ST-006 | Pass |
| #011 | Testing Arabic intent | ST-001ST-002 | Pass |
| #012 | Gibberish input handling | ST-001ST-005 | Pass |
Evaluation Methodology
Schema Validity: Validates that the LLM output strictly conforms to the Pydantic ComparisonResponse schema.
Grounding Score: Ensures all claims in the pros/cons and tradeoffs section are explicitly present in the synthetic JSON dataset.
Hallucination Rate: Checks if the LLM invents specs (e.g. making up a fold type when missing).
Arabic Fluency: Evaluated by an LLM-as-a-judge specifically prompted to penalize machine-translation artifacts.