Mumzworld AI Shopping Assistant

Schema Validity

100%

Grounding Score

94%

Hallucination Rate

Arabic Fluency

9.2/10

Avg Latency

2.8s

Test Cases (n=12)

Results from the synthetic 12-stroller dataset evaluation suite.

Test ID	Intent Evaluated	Products	Result
#001	Travel stroller under 1000 AED	ST-001ST-004ST-009	Pass
#002	Newborn friendly, large basket	ST-003ST-008ST-007	Pass
#003	Apartment living, stairs	ST-001ST-005ST-002	Pass
#004	Jogging and off-road	ST-012ST-007ST-008	Pass
#005	Missing data handling (fold_type)	ST-011ST-002	Pass
#006	Budget options	ST-002ST-004ST-012	Pass
#007	Luxury travel	ST-006ST-001ST-009	Pass
#008	Car seat compatible	ST-010ST-003	Pass
#009	Twins / Double conversion	ST-008ST-007	Pass
#010	Ultra compact one hand fold	ST-004ST-006	Pass
#011	Testing Arabic intent	ST-001ST-002	Pass
#012	Gibberish input handling	ST-001ST-005	Pass

Evaluation Methodology

Schema Validity: Validates that the LLM output strictly conforms to the Pydantic ComparisonResponse schema.

Grounding Score: Ensures all claims in the pros/cons and tradeoffs section are explicitly present in the synthetic JSON dataset.

Hallucination Rate: Checks if the LLM invents specs (e.g. making up a fold type when missing).

Arabic Fluency: Evaluated by an LLM-as-a-judge specifically prompted to penalize machine-translation artifacts.