Journeys Evaluation Dataset

Intent-grounded browsing dataset for evaluating Edge Journeys quality

⬇ Download All Data (ZIP)

Background

Edge Journeys is an AI-powered browser feature in Microsoft Edge's Copilot Mode that transforms browsing history into task-themed clusters, helping users resume and continue their work without starting over. The feature surfaces up to 3 Journey cards on the New Tab Page (NTP), each with a title, preview image, and suggested next-step action.

This dataset provides intent-grounded evaluation data — browsing sessions where we know exactly what the user was trying to do. This lets us objectively measure whether a generated Journey "got it right" instead of relying on heuristic metrics alone.

The dataset was collected using an LLM-driven browsing agent (claude-opus-4.6-1m) that role-plays as different user personas, making reactive browsing decisions based on actual page content — not scripted paths.

Layer 1 — Generation Quality (64 tasks)

Each task is a single-topic browsing session (~10 pages) with a defined browsing goal and ground-truth intent. Tests whether Journeys can generate a good card from relatively clean data.

  • POS (Positive, 50 tasks): A Journey should be generated. The user was actively researching a topic (e.g., comparing hotels, shopping for headphones, writing a paper). We evaluate card quality, CTA accuracy, and groundedness.
  • NEG (Negative, 14 tasks): A Journey should NOT be generated. Reasons include: task already completed (bought the product), trivial lookup (checked weather), noise-only session (background tabs), privacy-sensitive topic (surprise party, health), or expired event (past concert).

Layer 2 — Selection Quality (10 profiles)

Each profile is a multi-topic, multi-day browsing history (~100 pages) that combines 3-4 Layer 1 tasks with 60-70% background noise.

  • Tests signal extraction: Can Journeys find the 2-3 real topics buried in noise?
  • Tests suppression: Are completed tasks, sensitive topics, and expired events correctly excluded?
  • Tests dedup: If a user researched two similar topics, does Journeys merge or separate them correctly?
  • Tests ranking: Is the most important Journey shown first?

Evaluation Method

Defining what "good" means for Journeys, and how to measure it — across the full value chain from topic selection to user outcome.

Why Evaluating Journeys Is Hard

Three structural challenges make standard eval approaches break down

  • Messy input, multi-stage pipeline. Real-world browsing is noisy — dozens of tabs, background pages, half-finished lookups mixed with deep research. Journeys runs a multi-stage pipeline (intent detection → grouping → trigger decision → card generation → next-step suggestion) where each stage’s quality compounds. A failure anywhere — misdetected intent, wrong grouping, premature trigger — collapses the end result.
  • Trust is the product. Journeys surfaces cards proactively with zero user input — it must earn attention every time. If a card promises “Compare headphone prices” and the landing shows a generic summary, trust erodes, users stop clicking, the feature dies. We need to evaluate promise-delivery alignment, not just “was the response good.”
  • Different scenarios need different help. Shopping needs comparison tables, learning needs synthesis, getting things done needs shortcuts. A flat “rate 1–5” collapses distinct failure modes into one useless number.

Given these challenges, we structure evaluation around the user’s decision point — the click:

Selection
Right topic?
Card
Clear promise?
Click
Earned attention
Landing
Delivered value?
Outcome
Goal achieved?
Before ClickAfter Click
What we evaluateDid the system pick the right topic, at the right time, and present a clear, honest card?Did the landing page deliver on the card’s promise and actually move the user forward?
LayersL1 (Selection Quality) + L2 (Card Quality)L3 (Landing Quality) + L4 (Outcome Quality)
Key failureOver-triggering, wrong topic, vague cardHallucination, broken promise, wrong help type

The Trust Equation

Trust = Σ (Promise − Delivery) over time

If the card over-promises and the landing under-delivers, trust erodes — users stop clicking — Journeys dies. Promise-Delivery Alignment is the single most important cross-layer metric.

Layer 1 — Selection Quality

Out of everything we know about this user, did we pick the highest-value thing to surface?

Trigger Precision
Pass / Fail Offline
Should a Journey have been triggered at all — or should the system have stayed silent?
This is the binary gate before all other quality metrics apply. A Journey system must correctly decide when to fire and when to stay silent. Over-triggering (generating Journeys for completed tasks, trivial lookups, or noise) wastes user attention and erodes trust. Under-triggering (missing genuinely valuable research sessions) means the feature fails to deliver its core promise. How to measure: Treat Journey generation as a binary classification problem. For each browsing session in the eval dataset: True Positive = POS task correctly generates a Journey. True Negative = NEG task correctly produces no Journey. False Positive (over-trigger) = NEG task incorrectly generates a Journey. False Negative (under-trigger) = POS task fails to generate a Journey. Report standard classification metrics: Precision (of all triggered Journeys, how many should have been triggered?), Recall (of all sessions that deserved a Journey, how many actually got one?), and F1. Target: Precision ≥ 95%, Recall ≥ 90%. False positives are weighted more heavily than false negatives — a Journey that shouldn’t exist is worse than a Journey that’s missing.
Safety & Privacy
Pass / Fail Offline
Does this Journey violate user privacy by surfacing sensitive personal information?
Journeys must not surface content that violates user privacy — particularly browsing activity related to health/medical conditions, financial/banking information, adult content, legal services, or relationship/dating. Even if the browsing data is rich and intentful, privacy-sensitive topics must be suppressed. Multi-layer sensitive content filtering (WCF, DLP, ML classification) is applied to detect and block such content. How to measure: Privacy-tagged NEG tasks (NEG-F9-*) must produce zero Journeys. Any privacy failure is a ship-blocker regardless of other scores.
Relevance
30% Offline
Does each Journey correctly understand what the user was trying to do?
The generated Journey topic must match the user’s actual browsing intent. A user researching “best noise-cancelling headphones” should get a headphones comparison Journey — not a generic “audio equipment” card. How to measure: Compare Journey topic against ground-truth intent label in the eval dataset. Binary pass/fail per task.
Groundedness
25% Offline
Is the Journey linked to the right source URLs, without missing important ones?
The Journey’s relatedPages must include the most important pages from the user’s browsing session and must not hallucinate URLs the user never visited. How to measure: Precision (no hallucinated URLs) and recall (key pages included) against the ground-truth browsing history. Flag if the user’s most-dwelled primary pages are missing.
Helpfulness
25% Offline
Among sessions that should trigger, did we pick the highest-value topic to surface?
Given that Trigger Precision already gates whether a Journey should exist, Helpfulness evaluates value density: did we surface the topic that would help this user the most right now? A user with three active research threads should see the most urgent or highest-progress one first — not a topic they barely started. In Layer 2 profiles, this measures whether the system correctly ranks competing topics by value. How to measure: In Layer 1, verify that POS tasks produce Journeys aligned with the highest-signal browsing goal. In Layer 2 profiles, compare the ranked order of generated Journeys against expected priority from ground truth. Penalize surfacing low-value topics when higher-value ones exist.
Technical Feasibility
20% Cross-Layer
Can we actually deliver on what this Journey promises?
Don’t surface a Journey with a CTA like “Compare prices” if the landing experience can’t actually compare prices. Don’t promise “pick up where you left off” if we can’t restore meaningful context. How to measure: Flag any Journey whose CTA implies capability beyond what the landing experience supports. Requires comparing L1 output against L3 capability — hence cross-layer.

Eval Dataset Coverage: Layer 1 (64 tasks) directly tests Trigger Precision, Relevance, Groundedness, Helpfulness, and Safety with ground-truth intent labels. Layer 2 (10 profiles) adds noise-filtering, dedup, and ranking tests at production-realistic scale. Technical Feasibility requires end-to-end evaluation with the landing experience.

Layer 2 — Card Quality

Did the card earn the click — and set honest expectations for what's behind it?

Clarity
50% Offline
Can users instantly understand what this Journey is about and what they'll get if they click?
The title must be readable and specific — not jargon, not vague. The CTA must clearly communicate the next action. "Continue researching noise-cancelling headphones" is clear. "Explore audio" is not. How to measure: Human raters answer "After reading only the card title and CTA, can you predict what the landing page contains?" — binary yes/no. LLM judge can replicate at scale after calibration.
Promise Accuracy
50% Cross-Layer
Does the card accurately represent what the landing experience will deliver?
The card is a contract. If the title says "Prices dropped for Tokyo flights," the landing must show actual price changes. If the CTA says "See comparison," there must be a comparison — not a search page. Over-promising is worse than under-promising. How to measure: Compare card claims against landing content. Score promise-delivery alignment 1-5. Any score ≤ 2 is a critical failure. Requires L2 × L3 evaluation.
Dimension1 — Fail3 — Acceptable5 — Excellent
ClarityVague or misleading — user can't predict what they'll seeGenerally correct but generic title/CTASpecific, immediately understood, matches user's mental model
Promise AccuracyCard claims something the landing doesn't deliverLanding partially delivers on the card's promiseLanding fully matches or exceeds what the card promised

Layer 3 — Landing Quality

After the click, did the Copilot response actually deliver value? This is where the user decides if Journeys is worth coming back to.

Correctness
30% Offline
Are the claims in the Copilot response factually grounded in the user's browsing data?
Every factual claim in the response must be traceable to a source page the user actually visited. If the response says "the Sony WH-1000XM5 was rated 4.8/5," that rating must appear in the browsed pages. Hallucinated facts, invented prices, or unsupported recommendations are critical failures. How to measure: For each factual claim in the response, verify against raw_page_bodies in the eval dataset. Calculate claim precision = (supported claims / total claims). Target: ≥ 95%.
Completeness
25% Offline
Does the response cover the key threads of the user's research, or miss important ones?
If the user browsed 3 competing products, the response should mention all 3 — not just 1. If the user explored both budget and premium options, both should appear. Missing a major thread the user spent significant time on is a meaningful quality gap. How to measure: Compare topics/entities in the response against primary pages in the browsing history. Calculate topic recall = (covered topics / total primary topics). Weighted by dwell time — missing a high-dwell topic is worse than missing a briefly visited one.
Effort Reduction
20% Offline
Does the response organize and synthesize beyond what the user already has?
The response must add structure that the raw browsing history lacks. A good response turns 10 scattered product pages into a comparison table. A bad response just lists the same URLs the user already visited. The key question: did the Copilot do work the user would otherwise have to do themselves? How to measure: Expert rating 1-5. Score 1 = "just a list of links I already visited." Score 5 = "organized my research in a way that saves me 20+ minutes." LLM judge can assess structural transformation (e.g., did it create comparisons, summaries, or synthesis from raw pages?).
Actionability
15% Offline
Are there concrete, clickable next steps — not vague suggestions?
The response should enable immediate action. For shopping: direct links to buy, with prices. For research: specific follow-up questions or sources to check next. For trip planning: bookable options with dates. "You might want to look into this further" is a failure. How to measure: Count actionable elements (links, specific recommendations, next steps with clear targets). Binary: does the response contain at least one concrete action the user can take without going back to search? Scenario-dependent — a learning Journey may have lower actionability expectations than a shopping one.
Scenario Fit
10% Offline
Is the response format appropriate for the type of task?
The modality of help must match the user's scenario. Shopping → comparison table, not essay. Learning → synthesis with sources, not a link dump. Getting things done → direct shortcut, not a long explanation. How to measure: Classify the response format (comparison, summary, link list, step-by-step, narrative) and compare against expected format for the classified scenario type. Mismatch = fail.
🛒 Shopping
✓ Comparison table, real prices, buy links
✗ Generic paragraph, stale prices, no links
📚 Learning
✓ Synthesis, progressive depth, cited sources
✗ List of blue links already visited
⚡ Get Things Done
✓ Pre-filled action, one-click next step
✗ Info dump when user needed a button
✈️ Trip Planning
✓ Itinerary-aware options, bookable links
✗ Random travel blog articles
🔧 Troubleshooting
✓ Step-by-step fix, device-specific
✗ Generic FAQ from unrelated product
🔍 Exploring
✓ Curated inspiration, broader options
✗ Premature narrowing to one answer

Layer 4 — Outcome Quality

End-to-end: did the full Journey — from card to landing — actually move the user forward?

Self-Sufficiency
60% Offline
Could the user complete their goal from this response alone, without going back to search?
This is the ultimate offline proxy for value delivery. If a user was comparing headphones, can they make a purchase decision from the landing content? If they were planning a trip, can they book from here? How to measure: Expert rating 1-5. "Given this landing content and the user's ground-truth goal, how likely is the user to make progress without leaving?" Score 1 = "useless, would start over." Score 5 = "could complete the task right here." This is the single offline metric most predictive of real-world satisfaction.
End-to-End Coherence
40% Cross-Layer
Does the full chain — intent classification → card title → landing content — tell a consistent story?
Each layer must reinforce the others. If L1 identifies "shopping for headphones," the card title should say "headphones," and the landing should show headphone comparisons — not earbuds or speakers. Inconsistency anywhere in the chain confuses users. How to measure: Extract the primary topic/entity from each layer and check alignment. Any layer that introduces a different topic or contradicts another layer = coherence failure.
Engagement Signals
Online
Did the user engage with the landing? Continue the task? Complete a transaction?
Behavioral signals like click-through from card, dwell time on landing, downstream task completion, and return visits. These are the ground truth for value delivery — but only available in production with real users. Not measurable offline. Included here to mark the boundary of what offline eval can and cannot capture.
Longitudinal Trust
Online
Does the user click Journey cards more or less over time?
The ultimate success metric for Journeys as a product. If card CTR increases over weeks, the value chain is working. If it declines, the promise-delivery gap is eroding confidence. Not measurable offline. This is the north-star online metric that the offline eval framework aims to predict.

Failure Mode Taxonomy

Categorical failures flagged separately — any critical failure blocks ship regardless of composite score

Failure ModeLayerDescriptionSeverity
Wrong TopicL1Surfaced an irrelevant or low-priority topic🔴 Critical
Should Not GenerateL1Generated a card for completed, trivial, or expired task🔴 Critical
Over-TriggeringL1System generates Journeys too aggressively — surfacing cards for sessions that don't warrant them (noise, trivial, low-signal)🔴 Critical
Under-TriggeringL1System fails to generate a Journey for a session with clear, active research intent — user misses help they should have received🟡 High
Privacy ViolationL1Surfaced sensitive/embarrassing topic on NTP🔴 Ship-Blocker
HallucinationL3Response claims not supported by any browsed page🔴 Critical
Promise-Delivery GapL2→L3Card says one thing, landing shows another🔴 Critical
Over-PromiseL1→L3CTA implies capability the landing can't deliver🔴 Critical
Wrong Help TypeL3Correct topic but wrong format (info dump for action task)🟡 High
Missing Key SourceL1Important browsed page not included in related pages🟡 High
Stale ContentL3Prices, availability, or facts are outdated🟡 High
No ActionabilityL3Correct info but no concrete next step🟡 High
Vague CardL2Title/CTA too generic to set expectations🟢 Medium
RedundantL3Shows info user already encountered — no synthesis🟢 Medium

Composite Score

How layer scores combine — multiplicative, not additive

Journey Quality = L1(Selection) × L2(Card) × L3(Landing | Scenario) × L4(Outcome)

Multiplicative design. If any layer is zero (critical failure), the entire Journey scores zero. A wrong-topic card with a perfect landing is still a total miss. A perfect card with a hallucinating landing destroys trust. One broken link in the chain and the value is lost.

ResultWhat It MeansAction
✅ Pass (≥ 75)Journey delivers end-to-end value for this scenarioShip-ready
⚠️ Partial (50-74)Value exists but with meaningful gapsIdentify weakest layer, targeted fix
✗ Fail (< 50)Journey does not deliver valueRoot-cause by layer before ship
🚫 CriticalAny critical failure flag (hallucination, privacy, broken promise)Blocks ship — fix regardless of score

Evaluation Pipeline

Human-first calibration → LLM-as-Judge at scale

1
Golden Set
Build card + landing pairs from eval dataset with expert annotations per layer
2
Human Rating
2 raters per example, adjudication for disagreements, detailed rationale
3
LLM Judge Calibration
Calibrate against human labels, target Cohen's κ ≥ 0.7 per dimension
4
Scale
LLM judge on full corpus, 10% human spot-check for drift

Per-example evaluation record: browsing_history → ground_truth_intent → card → landing_page → L1 scores → L2 scores → L3 scores (scenario-conditional) → L4 scores → failure_flags → rationale

Open Questions for Future Evaluation

Areas we've identified but haven't yet built metrics for — tracked here for future framework updates

Journey Evolution
Offline
Does a Journey evolve correctly as the user browses more pages?
A Journey created after 5 pages should update its title, CTA, and related pages as the user continues browsing. If a user starts comparing hotels and then narrows to one specific hotel, the Journey should reflect the progression — not stay frozen at the initial broad search. Key questions: Does the topic sharpen over time? Are new high-signal pages incorporated? Does the CTA update to match the user’s current stage (e.g., from “Compare options” to “Book this hotel”)? Not yet measurable: Current eval dataset captures single-snapshot sessions. Requires longitudinal session data with multiple Journey generation checkpoints.
Expiration Accuracy
Offline
Does the system correctly predict when a Journey should expire?
Journeys have a shelf life. A “weekend trip to Portland” Journey should expire after the weekend. A “compare headphones” Journey should expire after the user purchases one. A “prepare for Monday’s presentation” Journey is worthless on Tuesday. Key questions: Does the system detect task completion signals (e.g., purchase confirmation page)? Does it respect temporal deadlines embedded in the intent? Does it avoid showing stale Journeys that have outlived their usefulness? Not yet measurable: Requires time-aware eval data with explicit expiration signals and post-expiration browsing behavior.
Timeliness
Offline
Is the Journey card surfaced at the right moment — not too early, not too late?
A Journey surfaced after 2 pages of browsing has too little signal to be useful. A Journey surfaced 3 days after the user finished their research is too late. The ideal moment is when the user has enough context for a meaningful card but hasn’t yet completed the task. Key questions: What’s the minimum browsing signal needed for a high-quality Journey? How quickly after a session does the card need to appear to still be relevant? Is there a “golden window” for each scenario type? Not yet measurable: Requires real-time generation timing data and user feedback on perceived timeliness.

What's Next

From dataset to full evaluation pipeline

PhaseDeliverableStatus
Phase 1: Evaluation Dataset64 L1 tasks + 10 L2 profiles with ground truth✅ Done
Phase 2: Evaluation Method4-layer quality framework with measurable dimensions✅ Done
Phase 3: Journey GenerationRun Journeys API on dataset, collect card + landing pairs⬜ Next
Phase 4: Expert LabelingPM reviews 15-20 pairs with 4-layer rubric + failure flags⬜ Planned
Phase 5: LLM Judge + RegressionCalibrated LLM evaluator, automated on every model/prompt change⬜ Planned
Copied!