Speaker Diarization Training Data: A 2026 Annotation Guide

Speaker Diarization Training Data: A 2026 Annotation Guide

Inside the annotation methodology behind speaker diarization training data: RTTM format, overlap handling, VAD handoff, DER targets, and multi-tier QA.

Speaker diarization training data has become a procurement line item for ASR teams shipping anything beyond single-speaker dictation, since meeting summarization and call analytics turned "who spoke when" into a billable feature. Production systems target Diarization Error Rate under 10% in 2026. This blog walks you through diarization training data: annotation methodology, overlapped speech handling, RTTM format, DER measurement, and multi-tier QA. Examples come from AIxBlock's audio and speech data.

What is speaker diarization training data?

Speaker diarization training data is multi-speaker audio paired with time-stamped labels that mark which speaker is talking at every moment in the recording. The labels identify speaker turns (one speaker speaking continuously), speaker change boundaries (where the floor switches), and overlapping speech (where two or more speakers talk simultaneously).

Diarization sits between transcription and speaker identification in the speech pipeline. Transcription asks what was said. Diarization asks who said it, when, and for how long. Speaker identification asks whether this is Speaker A or Speaker B, which requires either enrolled voiceprints or a closed set of identities.

Production diarization systems use the training data to learn three signals. Voice activity (which segments contain speech versus silence). Speaker change boundaries (the time points where speaker identity shifts). Speaker embeddings (the acoustic fingerprint that distinguishes one speaker from another across non-contiguous turns).

The data needed for each signal is different. Voice activity needs simple speech-versus-silence labels at frame level. Speaker change needs precise time boundaries at turn switches. Speaker embeddings need enough audio per speaker (typically 5 to 30 seconds) for the model to learn a stable voiceprint.

What is speaker diarization training data?

How is "who spoke when" data actually labeled?

Multi-speaker audio gets labeled in three passes. The first pass marks speech-versus-silence boundaries (VAD). The second pass assigns speaker IDs to each speech segment. The third pass marks overlapped speech regions with multiple speaker tags. Each pass is verified by an independent reviewer before the labels ship.

Pass one

An annotator listens to the recording and marks every speech segment with start and end timestamps in milliseconds. Background noise, silence, and non-speech sounds (laughter, coughs) stay unlabeled. The output is a sequence of speech intervals.

Pass two

The annotator listens again and assigns a speaker ID to each interval. The convention is Speaker_00, Speaker_01, Speaker_02 in order of first appearance. When the same voice returns later, the annotator reuses the existing ID. This is where the difficulty starts: identifying that a voice 47 minutes into a call is the same as Speaker_00 from the opening requires listening attention that most annotators sustain for roughly 20 to 30 minutes before accuracy slides.

Pass three

The annotator scans for overlapped speech. Where two speakers talk at the same time, both speaker IDs get assigned to the overlap region. Overlap detection is the hardest part of diarization annotation because the overlapping voices mask each other acoustically, and short overlaps (under 500 ms) are easy to miss.

Production-grade pipelines split these three passes across different annotators rather than asking one person to do all three. The handoff catches errors that single-pass annotation hides.

How is "who spoke when" data actually labeled?

What's the difference between speaker diarization and speaker identification?

Speaker diarization clusters speakers in a recording without knowing who they are beforehand. Speaker identification matches a voice against a known set of enrolled identities. Diarization outputs Speaker_00, Speaker_01, Speaker_02. Identification outputs "this is Alice, this is Bob."

The distinction matters because the training data requirements are different. Diarization data needs many speakers per recording (typically 2 to 10 in meeting or call audio) and does not require known identity. Identification data needs enrolled voiceprints, typically 30 to 60 seconds of clean speech per known speaker, with metadata linking the voiceprint to the person.

Diarization is the harder problem for unknown audio because the model has to figure out how many speakers exist, when they switch, and which segments belong together. Identification is the harder problem for known sets because the model has to discriminate between similar voices, often across different acoustic conditions.

Production speech systems use both. Diarization runs first on raw audio to segment by speaker. Identification then maps each diarized speaker to a known identity if there is an enrolled voiceprint. Without diarization, identification cannot tell where one speaker stops and another starts. Production-ready call center audio typically arrives diarized at delivery, with optional speaker identification added when the buyer needs known-identity labels.

How is overlapped speech handled in diarization training data?

Overlapped speech (two or more speakers talking simultaneously) is labeled with multiple speaker IDs assigned to the same time region. Production diarization training data typically contains 5 to 15% overlapped speech in call center audio, 10 to 25% in meeting audio. Models that do not see overlapped training data degrade sharply on real-world conversations.

Overlap is the rule, not the exception. People interrupt, agree mid-sentence, finish each other's thoughts, and react. The OTS call center audio catalog at AIxBlock includes both real overlapped speech from call center recordings and simulated overlap from H2H (human-to-human) role-play sessions for training models that need controlled overlap conditions.

Annotation of overlapped regions follows specific conventions. Each overlapping speaker gets their own time-stamped segment that spans the overlap region. The labels do not merge into a single "crosstalk" tag, because the model needs to learn which voice is which during the overlap. Production diarization toolkits like pyannote use multi-label outputs at each time frame to handle overlap natively.

The 500 ms overlap floor is operationally important. Overlaps shorter than 500 ms are hard for annotators to catch and hard for models to predict. Production teams have a choice: tag every overlap including the brief ones (high annotation cost, marginal model gain) or set a 500 ms threshold and document the choice in the dataset card.

Real-world speech datasets document the overlap patterns in actual call center recordings, which differ noticeably from the artificial overlap rates in research benchmarks like AMI Meeting Corpus.

What annotation format does diarization use (RTTM)?

Speaker diarization labels ship in Rich Transcription Time Marked (RTTM) format, a space-delimited text format defined by NIST for diarization scoring. Each line marks one speaker segment with file ID, channel, start time, duration, speaker ID, and confidence fields.

A production RTTM line looks like this:

The fields encode the file name (recording_001), channel number (1), start time in seconds (12.450), segment duration in seconds (3.200), and speaker label (Speaker_00). The <NA> placeholders stay for legacy compatibility with NIST scoring tools. Multiple lines per recording, one per speaker segment.

RTTM survived as the de facto format because the NIST diarization scoring scripts assume it. Custom JSON formats exist but require buyer-side conversion before running standard evaluation tools. Production datasets ship RTTM as the primary format with optional JSON or CTM (Conversation Time Marked) for buyer-specific pipelines.

Overlapped speech in RTTM is encoded as multiple lines that share an overlapping time range. A 2-second overlap between Speaker_00 and Speaker_01 produces two RTTM lines covering the same interval with different speaker IDs.

Worth noting: speakers should be numbered consistently across the recording. A common annotation error is assigning Speaker_03 to a voice that already appeared as Speaker_00 earlier. Multi-tier QA catches these consistency errors before delivery.

How does Voice Activity Detection relate to diarization labels?

Voice Activity Detection (VAD) finds the speech regions in audio. Diarization assigns speaker IDs to those regions. VAD is a prerequisite step that defines what diarization operates on. Errors in VAD propagate forward: false negatives (missed speech) become missing diarization labels, false positives (silence labeled as speech) become spurious speaker assignments.

VAD operates at frame level (typically 10 to 30 ms windows). Speech-versus-silence is the classification target. Production VAD systems target false-accept and false-reject rates below 5% on call-center conditions, with degradation on noisy environments, music backgrounds, and quiet whispered speech.

The handoff from VAD to diarization is where most pipeline errors enter. If VAD misses a 200 ms backchannel utterance ("uh-huh," "mm-hmm"), the speaker never gets that segment attributed. If VAD marks 800 ms of music as speech, the diarization model has to invent a speaker label for it.

Production diarization training data includes the original audio plus both VAD and diarization labels, so models can be trained jointly or independently depending on the architecture. End-to-end diarization systems like pyannote.audio handle VAD internally. Pipeline diarization systems use a separate VAD stage that can be retrained for specific acoustic conditions.

The cleaner approach for enterprise teams is to ship VAD-aware labels in the training data, with explicit speech and non-speech boundaries marked. This lets buyers tune their VAD and diarization stages independently without re-annotation.

How is diarization quality measured (DER, JER)?

Diarization quality is measured by Diarization Error Rate (DER), which combines three error types: false alarm (silence labeled as speech), missed speech, and speaker confusion. Lower is better. Production systems target DER under 10% on telephony audio, under 15% on meeting audio.

The DER formula sums total error time across the three categories and divides by total speech time. A DER of 8% means roughly 8% of speech time is incorrectly labeled in some way. The breakdown matters: a system with 2% false alarm, 2% missed speech, and 4% speaker confusion behaves differently from one with 0.5% false alarm, 0.5% missed speech, and 7% speaker confusion, even at the same total DER.

Jaccard Error Rate (JER) is a complementary metric that penalizes systematic speaker errors more than DER does. JER computes the per-speaker Jaccard similarity and averages, which surfaces models that consistently confuse two specific speakers across a recording.

Production target ranges vary by domain. Telephony (clean two-speaker calls) typically targets DER under 8%. Call center audio with three or more parties (customer, agent, supervisor monitoring) targets DER under 12%. Meeting recordings with 4 to 10 speakers and significant overlap target DER 10 to 20%. Broadcast audio (interviews, talk shows) typically targets DER 5 to 10% because the speaker count is small and the acoustic conditions are controlled.

NIST scoring with a 250 ms collar (ignoring errors within 250 ms of a speaker boundary) is the standard evaluation approach, used in the DIHARD III diarization challenge and most production benchmarks. Without the collar, scores look worse by 1 to 3 percentage points, so always check which scoring convention the model card uses.

What QA processes apply to multi-speaker audio annotation?

Multi-speaker audio annotation requires multi-tier QA because the error types are diverse. Production pipelines run three layers: primary annotation, QA review for time-boundary accuracy and speaker ID consistency, and senior QC2 review for overlap handling and ambiguous cases. Inter-annotator agreement on speaker boundaries typically targets 85% or higher before paid annotation begins.

The pyannote framework, the open-source toolkit that defines much of modern diarization and is documented in Bredin et al. 2020, illustrates why multi-tier review matters. Speaker boundary errors compound. If the primary annotator places a turn boundary 200 ms too late, the wrong speaker gets attributed for that 200 ms, the speaker embedding for both speakers gets corrupted, and downstream evaluation looks fine because each individual segment is mostly correct.

Worth noting: pyannote co-founder Hervé Bredin has characterized speaker diarization as "the foundational layer of conversational AI," which captures why label quality matters in diarization more than in many adjacent annotation tasks. Errors propagate into every downstream task (transcription attribution, speaker analytics, meeting summarization) and cannot be cleaned up afterward without re-listening.

Production QA tiers handle this through structured passes. Layer one (annotation) produces labels. Layer two (QA) checks time boundaries, speaker ID consistency, and overlap completeness. Layer three (QC2, senior reviewers) handles ambiguous cases: short backchannels, overlapping speech where one voice is barely audible, recordings where one speaker dominates and the second appears briefly.

Inter-annotator agreement on diarization is measured against a gold-standard reference, with thresholds set at 85% IAA on speaker boundaries and 90% on speaker ID assignments before paid annotation begins. Below these thresholds, recalibration is required before continuing. AIxBlock's enterprise speech data collection applies this QA structure across diarization and downstream annotation tasks.

Closing the loop

Speaker diarization training data has matured from research-grade RTTM files to production-grade annotation pipelines with multi-tier QA, overlap handling, and VAD-aware labels. The teams shipping production diarization in 2026 treat it as a labeled-data engineering discipline, not an annotation afterthought.

If you're scoping a multi-speaker audio annotation project for ASR or call analytics, talk to the AIxBlock data team about diarization-ready data, RTTM delivery, and the QA tiers that close DER under 10%.

FAQs About Speaker Diarization Training Data 

How much speaker diarization training data is needed for production? 

Production speaker diarization systems typically need 100 to 1,000 hours of diarized multi-speaker audio for fine-tuning, depending on deployment domain. Call center systems need 100 to 300 hours of diarized telephony audio. Meeting diarization systems need 300 to 1,000 hours of multi-party meeting recordings. Pyannote's pretrained pipeline trained on roughly 2,500 hours.

What's the typical DER for production diarization systems? 

Production diarization systems target DER under 10% on telephony audio, under 12% on call center audio with three or more parties, and 10 to 20% on meeting audio. Lower DER is achievable on broadcast audio (5 to 10%) where speaker counts are small and conditions are controlled. NIST scoring with a 250 ms collar is the standard evaluation.

How long does diarization annotation take per hour of audio? 

Diarization annotation typically takes 4 to 8 hours of annotator time per hour of audio when delivered with verbatim transcription. Diarization-only labels (without transcription) run 2 to 4 hours per audio-hour. Overlapped speech sections take longer because each overlap needs two or more speaker tags, and short overlaps under 500 ms are easy to miss.

Can Whisper or other foundation models do diarization? 

Whisper alone does not produce diarization labels. Foundation-model-based pipelines like WhisperX combine Whisper for transcription with pyannote.audio for diarization in a two-stage architecture. End-to-end speaker-aware ASR is an active research area, but production deployments in 2026 still use separate transcription and diarization models.