← Selected Work
AI-Assisted Review Workstation
Case Study

Designing AI-Assisted Review for High-Stakes Exam Integrity

How I designed a human-in-the-loop review workstation that helped a global enterprise shift from costly live proctoring to scalable AI-assisted record-and-review - making 3-hour manual reviews resolvable in under 10 minutes while maintaining 98% decision agreement.

AI/ML UXHuman-in-the-LoopDecision SupportExplainabilityUsability TestingFigma
Role & Responsibilities
Senior Product Designer - end-to-end UX strategy and interaction design
Product discovery, workflow design, prototyping, usability testing, cross-functional collaboration
Partnered with Product, Engineering, AI/ML, Operations, QA, and review teams
Users & Scope
Primary: Session reviewers - the core users making high-stakes decisions daily
Secondary: Review managers, QA trainers, and security investigators
Scope: Dashboard, risk-ranked queue, video review workstation, evidence packaging, escalation workflows, manager views
High-Level Impact
Enabled a strategic shift from costly live proctoring to a scalable record-and-review model
Reduced time to decision from 3+ hours to under 10 minutes per session
Achieved 4.5/5 user satisfaction in usability testing with reviewers, managers, and investigators
10×
Faster review vs. live proctoring
98%
Reviewer-investigator agreement
4.5 / 5
Usability satisfaction rating

The shift

A strategic product shift that required a new kind of review experience.

The business was shifting from a high-cost live proctoring model - where human proctors watched every exam in real time - to a scalable record-and-review approach where AI would analyze session recordings, flag potential infractions, and rank sessions by risk for human review.

This wasn't simply "adding AI to the workflow." It was designing a fundamentally new decision-support system where reviewers needed to understand, trust, and act on model-assisted signals - while maintaining the integrity standards required for high-stakes professional certifications.

I led product design for the end-to-end review workstation: from the risk-ranked dashboard through evidence review, decision workflows, escalation paths, and manager oversight views.

AI-assisted review workstation overview.

Where the model alone falls short

Manual review doesn't scale. AI flags alone aren't sufficient for high-stakes decisions.

The existing model had reviewers watching hours of exam video manually - an approach that couldn't scale with growing volume and created significant cost pressure. But the alternative (fully automated decisions) was unacceptable for high-stakes certifications where a wrong call could end someone's career or compromise program integrity.

The design challenge was finding the right balance: how do you make AI signals actionable without making the reviewer a rubber stamp? Raw AI flags without supporting evidence can introduce automation bias - where reviewers defer to the model instead of exercising judgment. The system needed to increase throughput dramatically while actually strengthening decision quality and auditability.

Understanding the reviewer

Understanding how reviewers make decisions - and where the workflow breaks.

The discovery process centered on understanding the reviewer's mental model: how they evaluate evidence, what signals give them confidence in a decision, where manual review creates cognitive fatigue, and what "audit-ready" actually means in practice.

Key discovery areas included:

  • Decision anatomy: What evidence do reviewers need to reach a clear/escalate/revoke decision? What makes a decision feel defensible?
  • Signal trust: When AI flags an infraction (candidate not present, third party detected, prohibited object), what context does a reviewer need to agree or disagree?
  • Role differences: Reviewers, managers, and investigators each need different views of the same data - triaging, oversight, and deep investigation respectively.
  • Escalation patterns: What happens when a reviewer finds genuine misconduct? How does the decision chain work from flag to case to outcome?
Investigator workflow screens.

Principles I held to

Five principles that shaped every decision.

  • Evidence first, not score first: Surface the "why" behind every AI flag - timestamped video, screenshots, event descriptions - so reviewers evaluate proof, not just confidence numbers.
  • Keep the reviewer in control: Structured decision options (did not occur / occurred, not escalated / occurred, escalated) that require explicit human judgment at every step.
  • Reduce cognitive load without removing judgment: Risk-ranked queues and pre-surfaced evidence so reviewers spend attention where it matters most.
  • Design for auditability: Every decision traceable - who reviewed, when, what evidence was available, what outcome was chosen.
  • Support multiple roles: The same underlying data, presented differently for reviewers (triage and decide), managers (oversight and activity reporting), and investigators (deep evidence review).
Triage-based review workflow.

Six decisions that shaped the workstation

Six decisions that shaped the workstation.

1. Risk-ranked dashboard with session queue. Sessions are ranked by AI risk score (highest first), so reviewers address the most critical cases first. The dashboard also surfaces data deletion timelines, enabling operational flexibility - reviewers can pull a lower-risk session if its data is expiring soon.

2. Evidence-first video review page. The review page presents video alongside AI-flagged events with timestamps, event descriptions, and screenshot evidence - all on a single screen. Reviewers praised the layout: everything needed to make a decision is visible without navigation. Clicking a flagged event timestamp jumps directly to that moment in the video.

3. Structured decision workflow. Each AI-flagged event requires an explicit decision: "did not occur," "occurred, not escalated," or "occurred, escalated." These structured options replace free-text judgment, supporting consistency across reviewers and enabling downstream telemetry.

4. Visual decision state tracking. Usability testing revealed that reviewers wanted to see at a glance which events had been reviewed, which hadn't, and which caused escalation. I designed distinct visual states for each decision outcome - making review progress scannable without re-reading event details.

5. Escalation and completion workflow. The mark-complete flow confirms session status (no escalation needed vs. escalation required), handles case number assignment, and manages artifact downloads for investigators.

6. Manager and investigator views. Managers see activity across their review team, with filtering by client, reviewer, and date range. Investigators access the full evidence package for escalated sessions.

Decision workflow with structured event decisions.

Validating with real reviewers

Usability testing with real reviewers, managers, and investigators.

I ran usability testing with 6 participants - experienced session reviewers, review managers, and investigators - across remote sessions. Participants completed core tasks: viewing the dashboard, reviewing AI-flagged events, adding session events, revoking sessions, clearing sessions, and viewing completed reviews.

Results:

  • Dashboard: 4.6 / 5 satisfaction - participants found the layout clean, clear, and containing the information they needed.
  • Video Review Page: 4.3 / 5 satisfaction - users were positive about the flagged event layout, evidence accessibility, and single-page review experience.
  • Overall: 4.5 / 5 satisfaction - participants looked forward to having this functionality in their daily workflow.

The testing surfaced actionable refinements: clearer visual states for reviewed vs. unreviewed events, video skip controls, timestamp-to-video linking, selective photo downloads for teams with PII restrictions, and manager-specific columns for reviewer attribution.

What it unlocked

From 3-hour sessions to 10-minute reviews - with stronger decisions.

The workstation enabled the business to shift from a costly live-proctoring-only model to a scalable record-and-review approach - reducing operational costs while maintaining the integrity standards required for high-stakes certifications.

  • 10× throughput improvement compared with manual review, making the record-and-review model operationally viable at scale.
  • 98% agreement between reviewers and investigators, demonstrating that the evidence-first design supported consistent, defensible decisions.
  • 4.5/5 usability satisfaction from reviewers, managers, and investigators during validation testing.

The system also unlocked new business flexibility: clients could configure how sessions were reviewed - percentage-based sampling, risk-threshold-based review, or full review - enabling participation in markets that required a lower-cost solution without compromising security.

What I took from this

The design contract in AI systems is explainability - not the model.

The model does the detection. The design makes it trustworthy. The hardest part of this project wasn't the interface - it was earning reviewer trust in a system where wrong decisions have real consequences for people's careers.

What I'd explore further: more sophisticated visual confidence calibration, richer escalation handoff context between reviewers and investigators, and longitudinal patterns across sessions - giving reviewers the ability to see if recurring flags emerge across multiple exams.

This project reinforced my belief that in human-AI systems, the senior design work isn't making the AI seem smart - it's making the human feel confident, informed, and supported enough to make the right call.

Outcomes
10×
Review throughput vs. manual
98%
Reviewer-investigator agreement
4.5 / 5
Usability satisfaction