Evaluation Dashboard

Mumzworld Compare AI Pipeline Metrics

Pipeline Passing

Schema Validity

100%

Grounding Score

94%

Hallucination Rate

0%

Arabic Fluency

9.2/10

Avg Latency

2.8s
Test Cases (n=12)
Results from the synthetic 12-stroller dataset evaluation suite.
Test IDIntent EvaluatedProductsResult
#001Travel stroller under 1000 AED
ST-001ST-004ST-009
Pass
#002Newborn friendly, large basket
ST-003ST-008ST-007
Pass
#003Apartment living, stairs
ST-001ST-005ST-002
Pass
#004Jogging and off-road
ST-012ST-007ST-008
Pass
#005Missing data handling (fold_type)
ST-011ST-002
Pass
#006Budget options
ST-002ST-004ST-012
Pass
#007Luxury travel
ST-006ST-001ST-009
Pass
#008Car seat compatible
ST-010ST-003
Pass
#009Twins / Double conversion
ST-008ST-007
Pass
#010Ultra compact one hand fold
ST-004ST-006
Pass
#011Testing Arabic intent
ST-001ST-002
Pass
#012Gibberish input handling
ST-001ST-005
Pass
Evaluation Methodology

Schema Validity: Validates that the LLM output strictly conforms to the Pydantic ComparisonResponse schema.

Grounding Score: Ensures all claims in the pros/cons and tradeoffs section are explicitly present in the synthetic JSON dataset.

Hallucination Rate: Checks if the LLM invents specs (e.g. making up a fold type when missing).

Arabic Fluency: Evaluated by an LLM-as-a-judge specifically prompted to penalize machine-translation artifacts.