Why Evaluating Journeys Is Hard
Three structural challenges make standard eval approaches break down
- Messy input, multi-stage pipeline. Real-world browsing is noisy — dozens of tabs, background pages, half-finished lookups mixed with deep research. Journeys runs a multi-stage pipeline (intent detection → grouping → trigger decision → card generation → next-step suggestion) where each stage’s quality compounds. A failure anywhere — misdetected intent, wrong grouping, premature trigger — collapses the end result.
- Trust is the product. Journeys surfaces cards proactively with zero user input — it must earn attention every time. If a card promises “Compare headphone prices” and the landing shows a generic summary, trust erodes, users stop clicking, the feature dies. We need to evaluate promise-delivery alignment, not just “was the response good.”
- Different scenarios need different help. Shopping needs comparison tables, learning needs synthesis, getting things done needs shortcuts. A flat “rate 1–5” collapses distinct failure modes into one useless number.
Given these challenges, we structure evaluation around the user’s decision point — the click:
Selection
Right topic?
→
Card
Clear promise?
→
Click
Earned attention
→
Landing
Delivered value?
→
Outcome
Goal achieved?
| Before Click | After Click |
| What we evaluate | Did the system pick the right topic, at the right time, and present a clear, honest card? | Did the landing page deliver on the card’s promise and actually move the user forward? |
| Layers | L1 (Selection Quality) + L2 (Card Quality) | L3 (Landing Quality) + L4 (Outcome Quality) |
| Key failure | Over-triggering, wrong topic, vague card | Hallucination, broken promise, wrong help type |
The Trust Equation
Trust = Σ (Promise − Delivery) over time
If the card over-promises and the landing under-delivers, trust erodes — users stop clicking — Journeys dies. Promise-Delivery Alignment is the single most important cross-layer metric.
Layer 1 — Selection Quality
Out of everything we know about this user, did we pick the highest-value thing to surface?
Trigger Precision
Pass / Fail
Offline
Should a Journey have been triggered at all — or should the system have stayed silent?
This is the binary gate before all other quality metrics apply. A Journey system must correctly decide when to fire and when to stay silent. Over-triggering (generating Journeys for completed tasks, trivial lookups, or noise) wastes user attention and erodes trust. Under-triggering (missing genuinely valuable research sessions) means the feature fails to deliver its core promise. How to measure: Treat Journey generation as a binary classification problem. For each browsing session in the eval dataset: True Positive = POS task correctly generates a Journey. True Negative = NEG task correctly produces no Journey. False Positive (over-trigger) = NEG task incorrectly generates a Journey. False Negative (under-trigger) = POS task fails to generate a Journey. Report standard classification metrics: Precision (of all triggered Journeys, how many should have been triggered?), Recall (of all sessions that deserved a Journey, how many actually got one?), and F1. Target: Precision ≥ 95%, Recall ≥ 90%. False positives are weighted more heavily than false negatives — a Journey that shouldn’t exist is worse than a Journey that’s missing.
Safety & Privacy
Pass / Fail
Offline
Does this Journey violate user privacy by surfacing sensitive personal information?
Journeys must not surface content that violates user privacy — particularly browsing activity related to health/medical conditions, financial/banking information, adult content, legal services, or relationship/dating. Even if the browsing data is rich and intentful, privacy-sensitive topics must be suppressed. Multi-layer sensitive content filtering (WCF, DLP, ML classification) is applied to detect and block such content. How to measure: Privacy-tagged NEG tasks (NEG-F9-*) must produce zero Journeys. Any privacy failure is a ship-blocker regardless of other scores.
Does each Journey correctly understand what the user was trying to do?
The generated Journey topic must match the user’s actual browsing intent. A user researching “best noise-cancelling headphones” should get a headphones comparison Journey — not a generic “audio equipment” card. How to measure: Compare Journey topic against ground-truth intent label in the eval dataset. Binary pass/fail per task.
Is the Journey linked to the right source URLs, without missing important ones?
The Journey’s relatedPages must include the most important pages from the user’s browsing session and must not hallucinate URLs the user never visited. How to measure: Precision (no hallucinated URLs) and recall (key pages included) against the ground-truth browsing history. Flag if the user’s most-dwelled primary pages are missing.
Among sessions that should trigger, did we pick the highest-value topic to surface?
Given that Trigger Precision already gates whether a Journey should exist, Helpfulness evaluates value density: did we surface the topic that would help this user the most right now? A user with three active research threads should see the most urgent or highest-progress one first — not a topic they barely started. In Layer 2 profiles, this measures whether the system correctly ranks competing topics by value. How to measure: In Layer 1, verify that POS tasks produce Journeys aligned with the highest-signal browsing goal. In Layer 2 profiles, compare the ranked order of generated Journeys against expected priority from ground truth. Penalize surfacing low-value topics when higher-value ones exist.
Technical Feasibility
20%
Cross-Layer
Can we actually deliver on what this Journey promises?
Don’t surface a Journey with a CTA like “Compare prices” if the landing experience can’t actually compare prices. Don’t promise “pick up where you left off” if we can’t restore meaningful context. How to measure: Flag any Journey whose CTA implies capability beyond what the landing experience supports. Requires comparing L1 output against L3 capability — hence cross-layer.
Eval Dataset Coverage: Layer 1 (64 tasks) directly tests Trigger Precision, Relevance, Groundedness, Helpfulness, and Safety with ground-truth intent labels. Layer 2 (10 profiles) adds noise-filtering, dedup, and ranking tests at production-realistic scale. Technical Feasibility requires end-to-end evaluation with the landing experience.
Layer 2 — Card Quality
Did the card earn the click — and set honest expectations for what's behind it?
Can users instantly understand what this Journey is about and what they'll get if they click?
The title must be readable and specific — not jargon, not vague. The CTA must clearly communicate the next action. "Continue researching noise-cancelling headphones" is clear. "Explore audio" is not. How to measure: Human raters answer "After reading only the card title and CTA, can you predict what the landing page contains?" — binary yes/no. LLM judge can replicate at scale after calibration.
Promise Accuracy
50%
Cross-Layer
Does the card accurately represent what the landing experience will deliver?
The card is a contract. If the title says "Prices dropped for Tokyo flights," the landing must show actual price changes. If the CTA says "See comparison," there must be a comparison — not a search page. Over-promising is worse than under-promising. How to measure: Compare card claims against landing content. Score promise-delivery alignment 1-5. Any score ≤ 2 is a critical failure. Requires L2 × L3 evaluation.
| Dimension | 1 — Fail | 3 — Acceptable | 5 — Excellent |
| Clarity | Vague or misleading — user can't predict what they'll see | Generally correct but generic title/CTA | Specific, immediately understood, matches user's mental model |
| Promise Accuracy | Card claims something the landing doesn't deliver | Landing partially delivers on the card's promise | Landing fully matches or exceeds what the card promised |
Layer 3 — Landing Quality
After the click, did the Copilot response actually deliver value? This is where the user decides if Journeys is worth coming back to.
Are the claims in the Copilot response factually grounded in the user's browsing data?
Every factual claim in the response must be traceable to a source page the user actually visited. If the response says "the Sony WH-1000XM5 was rated 4.8/5," that rating must appear in the browsed pages. Hallucinated facts, invented prices, or unsupported recommendations are critical failures. How to measure: For each factual claim in the response, verify against raw_page_bodies in the eval dataset. Calculate claim precision = (supported claims / total claims). Target: ≥ 95%.
Does the response cover the key threads of the user's research, or miss important ones?
If the user browsed 3 competing products, the response should mention all 3 — not just 1. If the user explored both budget and premium options, both should appear. Missing a major thread the user spent significant time on is a meaningful quality gap. How to measure: Compare topics/entities in the response against primary pages in the browsing history. Calculate topic recall = (covered topics / total primary topics). Weighted by dwell time — missing a high-dwell topic is worse than missing a briefly visited one.
Effort Reduction
20%
Offline
Does the response organize and synthesize beyond what the user already has?
The response must add structure that the raw browsing history lacks. A good response turns 10 scattered product pages into a comparison table. A bad response just lists the same URLs the user already visited. The key question: did the Copilot do work the user would otherwise have to do themselves? How to measure: Expert rating 1-5. Score 1 = "just a list of links I already visited." Score 5 = "organized my research in a way that saves me 20+ minutes." LLM judge can assess structural transformation (e.g., did it create comparisons, summaries, or synthesis from raw pages?).
Actionability
15%
Offline
Are there concrete, clickable next steps — not vague suggestions?
The response should enable immediate action. For shopping: direct links to buy, with prices. For research: specific follow-up questions or sources to check next. For trip planning: bookable options with dates. "You might want to look into this further" is a failure. How to measure: Count actionable elements (links, specific recommendations, next steps with clear targets). Binary: does the response contain at least one concrete action the user can take without going back to search? Scenario-dependent — a learning Journey may have lower actionability expectations than a shopping one.
Is the response format appropriate for the type of task?
The modality of help must match the user's scenario. Shopping → comparison table, not essay. Learning → synthesis with sources, not a link dump. Getting things done → direct shortcut, not a long explanation. How to measure: Classify the response format (comparison, summary, link list, step-by-step, narrative) and compare against expected format for the classified scenario type. Mismatch = fail.
🛒 Shopping
✓ Comparison table, real prices, buy links
✗ Generic paragraph, stale prices, no links
📚 Learning
✓ Synthesis, progressive depth, cited sources
✗ List of blue links already visited
⚡ Get Things Done
✓ Pre-filled action, one-click next step
✗ Info dump when user needed a button
✈️ Trip Planning
✓ Itinerary-aware options, bookable links
✗ Random travel blog articles
🔧 Troubleshooting
✓ Step-by-step fix, device-specific
✗ Generic FAQ from unrelated product
🔍 Exploring
✓ Curated inspiration, broader options
✗ Premature narrowing to one answer
Layer 4 — Outcome Quality
End-to-end: did the full Journey — from card to landing — actually move the user forward?
Self-Sufficiency
60%
Offline
Could the user complete their goal from this response alone, without going back to search?
This is the ultimate offline proxy for value delivery. If a user was comparing headphones, can they make a purchase decision from the landing content? If they were planning a trip, can they book from here? How to measure: Expert rating 1-5. "Given this landing content and the user's ground-truth goal, how likely is the user to make progress without leaving?" Score 1 = "useless, would start over." Score 5 = "could complete the task right here." This is the single offline metric most predictive of real-world satisfaction.
End-to-End Coherence
40%
Cross-Layer
Does the full chain — intent classification → card title → landing content — tell a consistent story?
Each layer must reinforce the others. If L1 identifies "shopping for headphones," the card title should say "headphones," and the landing should show headphone comparisons — not earbuds or speakers. Inconsistency anywhere in the chain confuses users. How to measure: Extract the primary topic/entity from each layer and check alignment. Any layer that introduces a different topic or contradicts another layer = coherence failure.
Engagement Signals
—
Online
Did the user engage with the landing? Continue the task? Complete a transaction?
Behavioral signals like click-through from card, dwell time on landing, downstream task completion, and return visits. These are the ground truth for value delivery — but only available in production with real users. Not measurable offline. Included here to mark the boundary of what offline eval can and cannot capture.
Longitudinal Trust
—
Online
Does the user click Journey cards more or less over time?
The ultimate success metric for Journeys as a product. If card CTR increases over weeks, the value chain is working. If it declines, the promise-delivery gap is eroding confidence. Not measurable offline. This is the north-star online metric that the offline eval framework aims to predict.
Composite Score
How layer scores combine — multiplicative, not additive
Journey Quality = L1(Selection) × L2(Card) × L3(Landing | Scenario) × L4(Outcome)
Multiplicative design. If any layer is zero (critical failure), the entire Journey scores zero. A wrong-topic card with a perfect landing is still a total miss. A perfect card with a hallucinating landing destroys trust. One broken link in the chain and the value is lost.
| Result | What It Means | Action |
| ✅ Pass (≥ 75) | Journey delivers end-to-end value for this scenario | Ship-ready |
| ⚠️ Partial (50-74) | Value exists but with meaningful gaps | Identify weakest layer, targeted fix |
| ✗ Fail (< 50) | Journey does not deliver value | Root-cause by layer before ship |
| 🚫 Critical | Any critical failure flag (hallucination, privacy, broken promise) | Blocks ship — fix regardless of score |
Open Questions for Future Evaluation
Areas we've identified but haven't yet built metrics for — tracked here for future framework updates
Journey Evolution
—
Offline
Does a Journey evolve correctly as the user browses more pages?
A Journey created after 5 pages should update its title, CTA, and related pages as the user continues browsing. If a user starts comparing hotels and then narrows to one specific hotel, the Journey should reflect the progression — not stay frozen at the initial broad search. Key questions: Does the topic sharpen over time? Are new high-signal pages incorporated? Does the CTA update to match the user’s current stage (e.g., from “Compare options” to “Book this hotel”)? Not yet measurable: Current eval dataset captures single-snapshot sessions. Requires longitudinal session data with multiple Journey generation checkpoints.
Expiration Accuracy
—
Offline
Does the system correctly predict when a Journey should expire?
Journeys have a shelf life. A “weekend trip to Portland” Journey should expire after the weekend. A “compare headphones” Journey should expire after the user purchases one. A “prepare for Monday’s presentation” Journey is worthless on Tuesday. Key questions: Does the system detect task completion signals (e.g., purchase confirmation page)? Does it respect temporal deadlines embedded in the intent? Does it avoid showing stale Journeys that have outlived their usefulness? Not yet measurable: Requires time-aware eval data with explicit expiration signals and post-expiration browsing behavior.
Is the Journey card surfaced at the right moment — not too early, not too late?
A Journey surfaced after 2 pages of browsing has too little signal to be useful. A Journey surfaced 3 days after the user finished their research is too late. The ideal moment is when the user has enough context for a meaningful card but hasn’t yet completed the task. Key questions: What’s the minimum browsing signal needed for a high-quality Journey? How quickly after a session does the card need to appear to still be relevant? Is there a “golden window” for each scenario type? Not yet measurable: Requires real-time generation timing data and user feedback on perceived timeliness.