Evaluation Method

Defining what "good" means for Journeys, and how to measure it — across the full value chain from topic selection to user outcome.

Why Evaluating Journeys Is Hard

Three structural challenges make standard eval approaches break down

Messy input, multi-stage pipeline. Real-world browsing is noisy — dozens of tabs, background pages, half-finished lookups mixed with deep research. Journeys runs a multi-stage pipeline (intent detection → grouping → trigger decision → card generation → next-step suggestion) where each stage’s quality compounds. A failure anywhere — misdetected intent, wrong grouping, premature trigger — collapses the end result.
Trust is the product. Journeys surfaces cards proactively with zero user input — it must earn attention every time. If a card promises “Compare headphone prices” and the landing shows a generic summary, trust erodes, users stop clicking, the feature dies. We need to evaluate promise-delivery alignment, not just “was the response good.”
Different scenarios need different help. Shopping needs comparison tables, learning needs synthesis, getting things done needs shortcuts. A flat “rate 1–5” collapses distinct failure modes into one useless number.

Given these challenges, we structure evaluation around the user’s decision point — the click:

Selection
Right topic?

→

Card
Clear promise?

→

Click
Earned attention

→

Landing
Delivered value?

→

Outcome
Goal achieved?

	Before Click	After Click
What we evaluate	Did the system pick the right topic, at the right time, and present a clear, honest card?	Did the landing page deliver on the card’s promise and actually move the user forward?
Layers	L1 (Selection Quality) + L2 (Card Quality)	L3 (Landing Quality) + L4 (Outcome Quality)
Key failure	Over-triggering, wrong topic, vague card	Hallucination, broken promise, wrong help type

The Trust Equation

Trust = Σ (Promise − Delivery) over time

If the card over-promises and the landing under-delivers, trust erodes — users stop clicking — Journeys dies. Promise-Delivery Alignment is the single most important cross-layer metric.

Layer 1 — Selection Quality

Out of everything we know about this user, did we pick the highest-value thing to surface?

Trigger Precision

Pass / Fail Offline

Should a Journey have been triggered at all — or should the system have stayed silent?

This is the binary gate before all other quality metrics apply. A Journey system must correctly decide when to fire and when to stay silent. Over-triggering (generating Journeys for completed tasks, trivial lookups, or noise) wastes user attention and erodes trust. Under-triggering (missing genuinely valuable research sessions) means the feature fails to deliver its core promise. How to measure: Treat Journey generation as a binary classification problem. For each browsing session in the eval dataset: True Positive = POS task correctly generates a Journey. True Negative = NEG task correctly produces no Journey. False Positive (over-trigger) = NEG task incorrectly generates a Journey. False Negative (under-trigger) = POS task fails to generate a Journey. Report standard classification metrics: Precision (of all triggered Journeys, how many should have been triggered?), Recall (of all sessions that deserved a Journey, how many actually got one?), and F1. Target: Precision ≥ 95%, Recall ≥ 90%. False positives are weighted more heavily than false negatives — a Journey that shouldn’t exist is worse than a Journey that’s missing.

Safety & Privacy

Pass / Fail Offline

Does this Journey violate user privacy by surfacing sensitive personal information?

Journeys must not surface content that violates user privacy — particularly browsing activity related to health/medical conditions, financial/banking information, adult content, legal services, or relationship/dating. Even if the browsing data is rich and intentful, privacy-sensitive topics must be suppressed. Multi-layer sensitive content filtering (WCF, DLP, ML classification) is applied to detect and block such content. How to measure: Privacy-tagged NEG tasks (NEG-F9-*) must produce zero Journeys. Any privacy failure is a ship-blocker regardless of other scores.

Relevance

30% Offline

Does each Journey correctly understand what the user was trying to do?

The generated Journey topic must match the user’s actual browsing intent. A user researching “best noise-cancelling headphones” should get a headphones comparison Journey — not a generic “audio equipment” card. How to measure: Compare Journey topic against ground-truth intent label in the eval dataset. Binary pass/fail per task.

Groundedness

25% Offline

Is the Journey linked to the right source URLs, without missing important ones?

The Journey’s relatedPages must include the most important pages from the user’s browsing session and must not hallucinate URLs the user never visited. How to measure: Precision (no hallucinated URLs) and recall (key pages included) against the ground-truth browsing history. Flag if the user’s most-dwelled primary pages are missing.

Helpfulness

25% Offline

Among sessions that should trigger, did we pick the highest-value topic to surface?

Given that Trigger Precision already gates whether a Journey should exist, Helpfulness evaluates value density: did we surface the topic that would help this user the most right now? A user with three active research threads should see the most urgent or highest-progress one first — not a topic they barely started. In Layer 2 profiles, this measures whether the system correctly ranks competing topics by value. How to measure: In Layer 1, verify that POS tasks produce Journeys aligned with the highest-signal browsing goal. In Layer 2 profiles, compare the ranked order of generated Journeys against expected priority from ground truth. Penalize surfacing low-value topics when higher-value ones exist.

Technical Feasibility

20% Cross-Layer

Can we actually deliver on what this Journey promises?

Don’t surface a Journey with a CTA like “Compare prices” if the landing experience can’t actually compare prices. Don’t promise “pick up where you left off” if we can’t restore meaningful context. How to measure: Flag any Journey whose CTA implies capability beyond what the landing experience supports. Requires comparing L1 output against L3 capability — hence cross-layer.

Eval Dataset Coverage: Layer 1 (64 tasks) directly tests Trigger Precision, Relevance, Groundedness, Helpfulness, and Safety with ground-truth intent labels. Layer 2 (10 profiles) adds noise-filtering, dedup, and ranking tests at production-realistic scale. Technical Feasibility requires end-to-end evaluation with the landing experience.

Layer 2 — Card Quality

Did the card earn the click — and set honest expectations for what's behind it?

Clarity

50% Offline

Can users instantly understand what this Journey is about and what they'll get if they click?

The title must be readable and specific — not jargon, not vague. The CTA must clearly communicate the next action. "Continue researching noise-cancelling headphones" is clear. "Explore audio" is not. How to measure: Human raters answer "After reading only the card title and CTA, can you predict what the landing page contains?" — binary yes/no. LLM judge can replicate at scale after calibration.

Promise Accuracy

50% Cross-Layer

Does the card accurately represent what the landing experience will deliver?

The card is a contract. If the title says "Prices dropped for Tokyo flights," the landing must show actual price changes. If the CTA says "See comparison," there must be a comparison — not a search page. Over-promising is worse than under-promising. How to measure: Compare card claims against landing content. Score promise-delivery alignment 1-5. Any score ≤ 2 is a critical failure. Requires L2 × L3 evaluation.

Dimension	1 — Fail	3 — Acceptable	5 — Excellent
Clarity	Vague or misleading — user can't predict what they'll see	Generally correct but generic title/CTA	Specific, immediately understood, matches user's mental model
Promise Accuracy	Card claims something the landing doesn't deliver	Landing partially delivers on the card's promise	Landing fully matches or exceeds what the card promised

Layer 3 — Landing Quality

After the click, did the Copilot response actually deliver value? This is where the user decides if Journeys is worth coming back to.

Correctness

30% Offline

Are the claims in the Copilot response factually grounded in the user's browsing data?

Every factual claim in the response must be traceable to a source page the user actually visited. If the response says "the Sony WH-1000XM5 was rated 4.8/5," that rating must appear in the browsed pages. Hallucinated facts, invented prices, or unsupported recommendations are critical failures. How to measure: For each factual claim in the response, verify against raw_page_bodies in the eval dataset. Calculate claim precision = (supported claims / total claims). Target: ≥ 95%.

Completeness

25% Offline

Does the response cover the key threads of the user's research, or miss important ones?

If the user browsed 3 competing products, the response should mention all 3 — not just 1. If the user explored both budget and premium options, both should appear. Missing a major thread the user spent significant time on is a meaningful quality gap. How to measure: Compare topics/entities in the response against primary pages in the browsing history. Calculate topic recall = (covered topics / total primary topics). Weighted by dwell time — missing a high-dwell topic is worse than missing a briefly visited one.

Effort Reduction

20% Offline

Does the response organize and synthesize beyond what the user already has?

The response must add structure that the raw browsing history lacks. A good response turns 10 scattered product pages into a comparison table. A bad response just lists the same URLs the user already visited. The key question: did the Copilot do work the user would otherwise have to do themselves? How to measure: Expert rating 1-5. Score 1 = "just a list of links I already visited." Score 5 = "organized my research in a way that saves me 20+ minutes." LLM judge can assess structural transformation (e.g., did it create comparisons, summaries, or synthesis from raw pages?).

Actionability

15% Offline

Are there concrete, clickable next steps — not vague suggestions?

The response should enable immediate action. For shopping: direct links to buy, with prices. For research: specific follow-up questions or sources to check next. For trip planning: bookable options with dates. "You might want to look into this further" is a failure. How to measure: Count actionable elements (links, specific recommendations, next steps with clear targets). Binary: does the response contain at least one concrete action the user can take without going back to search? Scenario-dependent — a learning Journey may have lower actionability expectations than a shopping one.

Scenario Fit

10% Offline

Is the response format appropriate for the type of task?

The modality of help must match the user's scenario. Shopping → comparison table, not essay. Learning → synthesis with sources, not a link dump. Getting things done → direct shortcut, not a long explanation. How to measure: Classify the response format (comparison, summary, link list, step-by-step, narrative) and compare against expected format for the classified scenario type. Mismatch = fail.

🛒 Shopping

✓ Comparison table, real prices, buy links

✗ Generic paragraph, stale prices, no links

📚 Learning

✓ Synthesis, progressive depth, cited sources

✗ List of blue links already visited

⚡ Get Things Done

✓ Pre-filled action, one-click next step

✗ Info dump when user needed a button

✈️ Trip Planning

✓ Itinerary-aware options, bookable links

✗ Random travel blog articles

🔧 Troubleshooting

✓ Step-by-step fix, device-specific

✗ Generic FAQ from unrelated product

🔍 Exploring

✓ Curated inspiration, broader options

✗ Premature narrowing to one answer

Layer 4 — Outcome Quality

End-to-end: did the full Journey — from card to landing — actually move the user forward?

Self-Sufficiency

60% Offline

Could the user complete their goal from this response alone, without going back to search?

This is the ultimate offline proxy for value delivery. If a user was comparing headphones, can they make a purchase decision from the landing content? If they were planning a trip, can they book from here? How to measure: Expert rating 1-5. "Given this landing content and the user's ground-truth goal, how likely is the user to make progress without leaving?" Score 1 = "useless, would start over." Score 5 = "could complete the task right here." This is the single offline metric most predictive of real-world satisfaction.

End-to-End Coherence

40% Cross-Layer

Does the full chain — intent classification → card title → landing content — tell a consistent story?

Each layer must reinforce the others. If L1 identifies "shopping for headphones," the card title should say "headphones," and the landing should show headphone comparisons — not earbuds or speakers. Inconsistency anywhere in the chain confuses users. How to measure: Extract the primary topic/entity from each layer and check alignment. Any layer that introduces a different topic or contradicts another layer = coherence failure.

Engagement Signals

— Online

Did the user engage with the landing? Continue the task? Complete a transaction?

Behavioral signals like click-through from card, dwell time on landing, downstream task completion, and return visits. These are the ground truth for value delivery — but only available in production with real users. Not measurable offline. Included here to mark the boundary of what offline eval can and cannot capture.

Longitudinal Trust

— Online

Does the user click Journey cards more or less over time?

The ultimate success metric for Journeys as a product. If card CTR increases over weeks, the value chain is working. If it declines, the promise-delivery gap is eroding confidence. Not measurable offline. This is the north-star online metric that the offline eval framework aims to predict.

Failure Mode Taxonomy

Categorical failures flagged separately — any critical failure blocks ship regardless of composite score

Failure Mode	Layer	Description	Severity
Wrong Topic	L1	Surfaced an irrelevant or low-priority topic	🔴 Critical
Should Not Generate	L1	Generated a card for completed, trivial, or expired task	🔴 Critical
Over-Triggering	L1	System generates Journeys too aggressively — surfacing cards for sessions that don't warrant them (noise, trivial, low-signal)	🔴 Critical
Under-Triggering	L1	System fails to generate a Journey for a session with clear, active research intent — user misses help they should have received	🟡 High
Privacy Violation	L1	Surfaced sensitive/embarrassing topic on NTP	🔴 Ship-Blocker
Hallucination	L3	Response claims not supported by any browsed page	🔴 Critical
Promise-Delivery Gap	L2→L3	Card says one thing, landing shows another	🔴 Critical
Over-Promise	L1→L3	CTA implies capability the landing can't deliver	🔴 Critical
Wrong Help Type	L3	Correct topic but wrong format (info dump for action task)	🟡 High
Missing Key Source	L1	Important browsed page not included in related pages	🟡 High
Stale Content	L3	Prices, availability, or facts are outdated	🟡 High
No Actionability	L3	Correct info but no concrete next step	🟡 High
Vague Card	L2	Title/CTA too generic to set expectations	🟢 Medium
Redundant	L3	Shows info user already encountered — no synthesis	🟢 Medium

Composite Score

How layer scores combine — multiplicative, not additive

Journey Quality = L1(Selection) × L2(Card) × L3(Landing | Scenario) × L4(Outcome)

Multiplicative design. If any layer is zero (critical failure), the entire Journey scores zero. A wrong-topic card with a perfect landing is still a total miss. A perfect card with a hallucinating landing destroys trust. One broken link in the chain and the value is lost.

Result	What It Means	Action
✅ Pass (≥ 75)	Journey delivers end-to-end value for this scenario	Ship-ready
⚠️ Partial (50-74)	Value exists but with meaningful gaps	Identify weakest layer, targeted fix
✗ Fail (< 50)	Journey does not deliver value	Root-cause by layer before ship
🚫 Critical	Any critical failure flag (hallucination, privacy, broken promise)	Blocks ship — fix regardless of score

Evaluation Pipeline

Human-first calibration → LLM-as-Judge at scale

Golden Set

Build card + landing pairs from eval dataset with expert annotations per layer

Human Rating

2 raters per example, adjudication for disagreements, detailed rationale

LLM Judge Calibration

Calibrate against human labels, target Cohen's κ ≥ 0.7 per dimension

Scale

LLM judge on full corpus, 10% human spot-check for drift

Per-example evaluation record: browsing_history → ground_truth_intent → card → landing_page → L1 scores → L2 scores → L3 scores (scenario-conditional) → L4 scores → failure_flags → rationale

Open Questions for Future Evaluation

Areas we've identified but haven't yet built metrics for — tracked here for future framework updates

Journey Evolution

— Offline

Does a Journey evolve correctly as the user browses more pages?

A Journey created after 5 pages should update its title, CTA, and related pages as the user continues browsing. If a user starts comparing hotels and then narrows to one specific hotel, the Journey should reflect the progression — not stay frozen at the initial broad search. Key questions: Does the topic sharpen over time? Are new high-signal pages incorporated? Does the CTA update to match the user’s current stage (e.g., from “Compare options” to “Book this hotel”)? Not yet measurable: Current eval dataset captures single-snapshot sessions. Requires longitudinal session data with multiple Journey generation checkpoints.

Expiration Accuracy

— Offline

Does the system correctly predict when a Journey should expire?

Journeys have a shelf life. A “weekend trip to Portland” Journey should expire after the weekend. A “compare headphones” Journey should expire after the user purchases one. A “prepare for Monday’s presentation” Journey is worthless on Tuesday. Key questions: Does the system detect task completion signals (e.g., purchase confirmation page)? Does it respect temporal deadlines embedded in the intent? Does it avoid showing stale Journeys that have outlived their usefulness? Not yet measurable: Requires time-aware eval data with explicit expiration signals and post-expiration browsing behavior.

Timeliness

— Offline

Is the Journey card surfaced at the right moment — not too early, not too late?

A Journey surfaced after 2 pages of browsing has too little signal to be useful. A Journey surfaced 3 days after the user finished their research is too late. The ideal moment is when the user has enough context for a meaningful card but hasn’t yet completed the task. Key questions: What’s the minimum browsing signal needed for a high-quality Journey? How quickly after a session does the card need to appear to still be relevant? Is there a “golden window” for each scenario type? Not yet measurable: Requires real-time generation timing data and user feedback on perceived timeliness.

What's Next

From dataset to full evaluation pipeline

Phase	Deliverable	Status
Phase 1: Evaluation Dataset	64 L1 tasks + 10 L2 profiles with ground truth	✅ Done
Phase 2: Evaluation Method	4-layer quality framework with measurable dimensions	✅ Done
Phase 3: Journey Generation	Run Journeys API on dataset, collect card + landing pairs	⬜ Next
Phase 4: Expert Labeling	PM reviews 15-20 pairs with 4-layer rubric + failure flags	⬜ Planned
Phase 5: LLM Judge + Regression	Calibrated LLM evaluator, automated on every model/prompt change	⬜ Planned

Journeys Evaluation Dataset

Background

Layer 1 — Generation Quality (64 tasks)

Layer 2 — Selection Quality (10 profiles)

Evaluation Method

Why Evaluating Journeys Is Hard

The Trust Equation

Layer 1 — Selection Quality

Layer 2 — Card Quality

Layer 3 — Landing Quality

Layer 4 — Outcome Quality

Failure Mode Taxonomy

Composite Score

Evaluation Pipeline

Open Questions for Future Evaluation

What's Next