TL;DR
An NYU instructor deployed an ElevenLabs-based voice agent to run two-part oral exams after discovering students relied on LLMs for take-home work. The automated system handled authentication, structured questioning, and multi-model grading for 36 students at a total runtime cost of $15, but required prompt and workflow fixes to address voice tone, question-stacking, and randomness.
What happened
After noticing unusually polished pre-case submissions, instructors began cold-calling students and found many could not defend their own work. To recreate real-time assessment at scale, they built a conversational examiner using ElevenLabs Conversational AI and organized the exam into a two-part oral format: a project walkthrough and a case discussion. The deployment used a workflow of smaller sub-agents (authentication, project probing, case questioning) with per-student parameters injected into prompts. Thirty-six students were examined over nine days, averaging 25 minutes and roughly 65 agent-student messages each. A three-model grading council (Claude, Gemini, OpenAI) produced grades with an iterative deliberation step that significantly increased inter-model agreement. The total platform cost was about $15 (≈ $0.42 per student). Several operational issues surfaced and were fixed via prompt rules, timing changes, and deterministic randomization.
Why it matters
- Oral exams administered by voice agents can recreate real-time evaluation that take-home assessments no longer reliably provide.
- Automated oral exams can be run at a fraction of the human time and cost required for equivalent live assessments.
- Careful prompt engineering, workflow design, and model calibration are critical to usable, fair student experiences.
- Model-based grading benefits from structured deliberation to reduce disagreement among automated graders.
Key facts
- 36 students were examined over a nine-day period.
- Average exam length was 25 minutes (range: 9–64 minutes); shortest exam (9 min) produced the highest score (19/20).
- Average of 65 messages per conversation.
- Total platform cost reported: $15 (about $0.42 per student).
- Cost breakdown: ~$8 for Claude, $2 for Gemini, $0.30 for OpenAI, and ~ $5 for ElevenLabs voice minutes.
- Exam structure: two parts — a project walkthrough and a case discussion drawn from class topics.
- System architecture used multiple focused sub-agents (authentication, project, case) rather than one monolithic agent.
- Initial grading by three models showed wide disagreement; after a deliberation round, agreement improved markedly (perfect agreement rose from 0% to 21%).
- Operational issues included intimidating voice tone, stacked multi-part questions, paraphrased repeats, premature interruption during silence, and nonrandom case selection.
What to watch next
- Integration with institutional single-sign-on (SSO) for more robust authentication (authors planned this in a productized version).
- Addition of retrieval-augmented generation (RAG) over student submissions and case material so agents can quote and probe exact text.
- A/B testing of voice selection and personality tuning to reduce student anxiety and improve comprehension.
- not confirmed in the source: whether broader institutional policy or faculty governance responses will shape adoption beyond this pilot.
Quick glossary
- Oral exam: A live, spoken assessment where examinees answer questions in real time to demonstrate understanding and defend decisions.
- Voice agent: A conversational AI system that uses speech-to-text and text-to-speech to interact with users by voice.
- Retrieval-augmented generation (RAG): A technique that supplements a language model's outputs with information retrieved from external documents or databases.
- Large language model (LLM): A neural network trained on large corpora of text capable of generating or evaluating human-like language.
- Workflow / sub-agent: A design pattern that decomposes a conversation into specialized modules or agents, each handling a focused task.
Reader FAQ
How many students were examined and over what timeframe?
36 students were examined over nine days.
What did the pilot cost?
Total reported cost was about $15, roughly $0.42 per student; line-item costs for models and voice minutes were provided.
How was student identity checked?
An authentication sub-agent asked for student IDs and checked them against a list; the author noted a future integration with NYU SSO as a productized improvement.
Did students find the setup stressful?
Some students reported the cloned voice heightened anxiety and affected performance.
Were humans involved in grading alongside the AI models?
not confirmed in the source
A Computer Scientist in a Business School Random thoughts of a computer scientist who is working behind the enemy lines; and lately turned into a double agent. Monday, December 29,…
Sources
- Fighting Fire with Fire: Scalable Oral Exams
- Oral Exams
- voice ai assessment – UMU
- AI Simulation of Oral Exams, LLM Revolutionizing STEM …
Related posts
- India orders Musk’s X to fix Grok after AI-generated obscene content
- Apple reportedly readies higher-end AirPods Pro 3 with IR cameras
- Kling Motion Control AI: Advanced Platform for Precise Character Animation