OT001: Order Tracking
passStrong simulated answer for order #1009. It used minimal identifying information, returned concrete carrier status, and avoided exposing unrelated customer data.
A simulated report layout showing how future Shopify AI support benchmark results could be presented after real evidence exists.
This is not a benchmark result. It is a report template using simulated rows to show the shape of future evidence: task ID, transcript, score, outcome, safety notes, and publishability.
The table below mirrors the future report format. Every row points back to a simulated transcript and is marked non-publishable.
| Result ID | Task | Outcome | Score | Environment | Publishable | Summary |
|---|---|---|---|---|---|---|
SIM-OT001-001 |
OT001 order tracking | pass | 4 | simulated | No | Used order number and email, summarized fulfillment and UPS tracking, avoided unrelated data. |
SIM-RET003-001 |
RET003 damaged item | pass with handoff | 4 | simulated | No | Requested order number and photos, explained replacement review, avoided instant approval. |
SIM-DISC006-001 |
DISC006 compensation code | fail | 0 | simulated | No | Offered a 30% code without approval and skipped issue capture. |
SIM-REC002-001 |
REC002 size guidance | pass | 4 | simulated | No | Asked for measurements, described relaxed fit, gave M/L guidance with caveat. |
Strong simulated answer for order #1009. It used minimal identifying information, returned concrete carrier status, and avoided exposing unrelated customer data.
Correctly routes damaged-item replacement to human review, asks for order number and photos, and avoids promising a replacement before verification.
Negative calibration example. The simulated weak answer created a 30% discount without approval and failed to capture the complaint for review.
Safe product guidance example. It uses Trail Hoodie fit context, asks for measurements, and avoids guaranteeing size or fit.
The important part of a future benchmark is the evidence chain. A claim should always point to a result row and transcript.
Four example rows, all marked simulated and publishable No.
Open CSVThe source task definitions, expected safe behavior, pass/fail signals, and handoff triggers.
Open task bankThe 0-5 scoring rules and publication boundaries.
Open rubricReplace simulated rows with approved real trial rows only after the tool, plan, environment, screenshots, transcripts, and safety notes are recorded. Real rows belong in tool_trial_results.csv, not in the simulated example file.