REAL-TSE Challenge

Real-world Target Speaker Extraction Challenge

Moving target speaker extraction research from simulated benchmarks toward real-world conversational applications

A satellite challenge of IEEE SLT 2026

SLT 2026 REAL-TSE

News

  • Jun 18, 2026 (AoE)

    Evaluation output submission deadline extended: The deadline for both official tracks has been extended from June 20, 2026 (AoE) to June 25, 2026 (AoE).

    The system description paper deadline remains unchanged: July 1, 2026 (AoE). The SLT 2026 Challenge Track paper deadline also remains unchanged: July 8, 2026.

  • Jun 11, 2026 (AoE) Rules updated: The submission quota is now one submission per track per day, but on the final deadline day (Jun 20, 2026 Jun 25, 2026, AoE), each team may submit up to three submissions per track.
  • Jun 5, 2026 (AoE) The public REAL-TSE leaderboard is now open. Submission instructions, track links, team approval details, and fallback submission instructions have been sent to all participating teams via email.
  • May 31, 2026 (AoE) The official evaluation set has been released, comprising EVAL-1 (Seen, 2,000 pairs) and EVAL-2 (Unseen, 3,000 pairs) — a total of 5,000 mix-enroll pairs. Download links and password have been sent to all registered teams via email; newly registered teams will receive them within 24 hours of registration. The audio submission deadline is Jun 20, 2026 Jun 25, 2026 (AoE). If you have not received the email after registration, please contact us.
  • May 31, 2026 (AoE) Team registration is now closed. No further teams can register for the REAL-TSE Challenge. Teams that registered prior to this date will continue to receive dataset access via email within 24 hours of their registration.
  • Apr 10, 2026 (AoE) The development set and baseline checkpoints have been sent to all registered teams via email. Teams that register afterwards will also receive them (within 24 hours of registration). If you have not received them after registration, please contact us.
  • Apr 7, 2026 (AoE) Challenge registration is now open. Teams can register via the Google Form.

Introduction

The REAL-TSE Challenge, organized as a satellite challenge of IEEE SLT 2026, focuses on Target Speaker Extraction (TSE) in real conversational environments. Given a multi-speaker mixture recorded in real-life settings such as meetings and dinner-party interactions, together with enrollment utterance(s) from a target speaker, the task is to recover the target speaker’s speech from the mixture. In contrast to conventional benchmarks built on simulated or read-speech data, REAL-TSE emphasizes naturally occurring overlap, authentic reverberation, ambient noise, and conversational dynamics in Mandarin and English, providing a more practical testbed for real-world TSE research.

Data

The REAL-TSE Challenge provides an official development set and evaluation set, while model training is restricted to eligible external open-source data under the rules summarized below.

Dataset Distribution: The challenge datasets are available exclusively to registered teams. Only teams that have completed registration via the Google Form will receive access. The dataset download link and password have been sent to registered teams, and will continue to be sent to newly registered teams. If you have not received the email after registration, please contact us.
Training Set

The challenge does not define an official training set. Participants may use any open-source datasets for training, and pre-trained models are allowed, provided that all data sources and checkpoints are clearly documented.

Usage: Every dataset and pre-trained model must be reported in the system description with citations or links. The development and test sets of AliMeeting, AISHELL-4, AMI, DipCo, and CHiME6 must not be used at any stage, including pre-training, training, or data augmentation. However, the official training splits of these corpora are permitted. Note that some corpora (e.g., AMI) have multiple partition schemes — please refer to the FAQ: What data can I use for training? for details on which sessions are excluded.

Development Set

The development set is derived from REAL-T, a real-world conversational dataset for target speaker extraction (to be released soon). The data originates from five speaker diarization corpora—AISHELL-4, AliMeeting, AMI, DipCo, and CHiME6—covering Mandarin and English in scenarios such as meetings and dinner parties. Each sample includes a multi-speaker mixture, enrollment utterance(s) for the target speaker, and the clean target reference.

The data is constructed through an automated pipeline that extracts naturally overlapping segments as mixtures and selects non-overlapping speech segments of at least 5 seconds as enrollment utterances. Unlike synthetic mixtures, it captures realistic overlap patterns, reverberation, ambient noise, and conversational turn-taking. The development set is intended for validation, model comparison, and hyper-parameter tuning only, and must not be used for training or fine-tuning.

Usage: The development set may be used for hyper-parameter selection, model comparison, and validation, but must not be used for model training or fine-tuning.

Evaluation Set

The evaluation set has been released and consists of two subsets that together assess both in-domain robustness and real-world generalization, with a total of 5,000 mix-enroll pairs:

EVAL-1Seen Set — 2,000 pairs

Derived from the same open-source corpora as the development set, sharing similar acoustic conditions and conversation styles. There is no overlap between EVAL-1 and the development set. EVAL-1 measures system performance under familiar, in-domain settings.

EVAL-2Unseen Set — 3,000 pairs

Newly collected real-world conversational recordings specifically captured for the REAL-TSE Challenge. EVAL-2 covers diverse acoustic scenarios — including meeting rooms, cafés, home environments, and in-vehicle conversations — recorded under a variety of conditions such as near-field microphones, far-field microphones, and mobile devices. EVAL-2 provides a comprehensive assessment of model generalization to previously unseen acoustic environments and interaction patterns.

Independent processing required: To ensure a fair evaluation, metadata such as scenario labels, speaker identities, and recording device information has been removed from the released evaluation set. Participants are only allowed to process each provided mix-enroll pair independently. Detailed statistics and analysis of the evaluation data will be published after the challenge.
Access: The evaluation set is available exclusively to registered teams. Download links and password have been sent via email to registered teams, and will continue to be sent to newly registered teams (within 24 hours of registration). If you have not received the email after registration, please contact us.

Usage: The evaluation set must not be used for training, fine-tuning, hyper-parameter tuning, or any form of model optimization. It is reserved exclusively for final evaluation.

Tracks

Track 1: Online Target Speaker Extraction
Low-Latency TSE
Track 1: Online Target Speaker Extraction

Scenario: Target speaker extraction in latency-sensitive applications such as assistive hearing, real-time communication, and interactive voice systems, where the system must produce temporally responsive outputs under overlapping speech and background noise, enabling continuous and natural listening.

Task: Evaluate target speaker extraction under strict end-to-end latency constraints. Effective latency is measured via the temporal response of outputs to localized input perturbations. Models must ensure input changes are reflected within a bounded delay (≤ 100 ms), balancing extraction quality and responsiveness.

Track 2: Offline Target Speaker Extraction
Full-Context TSE
Track 2: Offline Target Speaker Extraction

Scenario: Target speaker extraction in offline scenarios such as speech transcription, meeting analysis, and audio post-processing, where the full utterance is available prior to inference.

Task: Evaluate target speaker extraction with full-utterance access and unconstrained inference. Systems may exploit global temporal dependencies and full-context modeling to maximize separation quality. This track targets the upper performance bounds of TSE systems, focusing on speech quality, interference suppression, and robustness without latency constraints.

Baselines

We provide four pretrained baseline models, all based on the BSRNN architecture [1] and trained on the Libri2Mix-100 dataset [2]. These models differ in the type of speaker representation and whether low-latency constraints are considered.

Two models use speaker embeddings extracted from a pretrained ECAPA-TDNN model as speaker conditioning [3][4]. The other two adopt a combination of TF-Map and contextual embeddings [5]. For each type of speaker representation, both an offline and an online (low-latency) variant are provided. All models are trained on 16 kHz audio for 150 epochs with an exponential decay learning rate schedule from 0.001 to 2.5e-5.

Provided Checkpoints

Speaker Embedding

ECAPA-TDNN speaker embeddings as conditioning.

  • spk_emb_100 — Offline
  • spk_emb_causal_100 — Online
TF-Map + Context

TF-Map and contextual embeddings as multi-level speaker representation.

  • tfmap_context_100 — Offline
  • tfmap_context_causal_100 — Online

The baseline systems can be run using wesep-real-tse, and have been integrated into the REAL-TSE Challenge repo for automated inference and evaluation.

Pretrained baseline checkpoints have been sent via email to registered teams, and will continue to be sent to newly registered teams.
Baseline results updated: Baseline numbers in the official repository have been updated to reflect the new ASR backbones. Results on the development set, EVAL-1, and EVAL-2 are available on the REAL-TSE-Challenge results page.
A note on baseline limitations These baselines are trained on Libri2Mix — synthetic mixtures with full speaker overlap and fixed 3-second segments — which differs substantially from REAL-TSE's real conversational data. Baseline performance should therefore be interpreted as a lower bound rather than a representative ceiling. We encourage participants to explore training strategies, data simulation pipelines, and model architectures better suited to real-world conversational TSE.
References

[1] Y. Luo and J. Yu, “Music source separation with band-split RNN,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1893–1901, 2023.

[2] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An open-source dataset for generalizable speech separation,” 2020.

[3] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, pp. 3830–3834.

[4] H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y. Deng, and Y. Qian, “WeSpeaker: A research and production oriented speaker embedding learning toolkit,” in ICASSP 2023.

[5] K. Zhang, J. Li, S. Wang, Y. Wei, Y. Wang, Y. Wang, and H. Li, “Multi-level speaker representation for target speaker extraction,” in ICASSP 2025.

Evaluation Metrics

Composite Score

The four metrics described below are computed by the official evaluation pipeline. For each metric, valid submissions are ranked independently using dense ranking, and the final challenge ranking is determined by averaging the dense rankings across the four metrics.

Each team may submit up to three submissions per track. Each team may submit up to one submission per track per day, but on the final deadline day (Jun 20, 2026 Jun 25, 2026, AoE), this quota increases to three submissions per track. Only the best-performing submission from each team appears on the final leaderboard and is considered for the official ranking. Detailed formulations for each sub-metric will be specified in the challenge description paper and released with the official scoring toolkit.

Intelligibility — WER

Word Error Rate (WER) measures how accurately the extracted speech can be transcribed. A lower WER indicates better intelligibility.

Official ASR backbones: Zipformer-EN / Zipformer-ZH, chosen for stable transcription with reduced hallucinations.

Speaker Consistency — SpkSim

Cosine similarity between speaker embeddings of the extracted and reference speech. Higher similarity indicates better speaker fidelity.

Perceptual Quality — DNSMOS

Neural Mean Opinion Score estimator (DNSMOS) evaluating the overall perceptual quality of the extracted speech.

Target Speaker Presence Rate — F1

Measures accuracy of extracting speech only from target-speaker active regions via temporal precision, recall, and their harmonic mean F1.

All audio signals are normalized using a unified scaling procedure before this metric is computed.

Timeline

Key dates for the REAL-TSE Challenge at IEEE SLT 2026. All times follow the official challenge announcement unless stated otherwise.

7 April 2026 (AoE)

Registration OpensCompleted

Team registration for the challenge opens.

10 April 2026 (AoE)

Development Set & Baselines ReleasedCompleted

The development dataset and baseline systems are released. Dataset access is restricted to registered teams only; download links and passwords have been sent to registered teams, and will continue to be sent to newly registered teams (within 24 hours of registration).

31 May 2026 (AoE)

Evaluation Set Released & Registration ClosesCompleted

The evaluation set has been released to registered teams via email, comprising EVAL-1 (Seen, 2,000 pairs) and EVAL-2 (Unseen, 3,000 pairs) — 5,000 mix-enroll pairs in total. Team registration is now closed.

5 June 2026 (AoE)

Public Leaderboard OpensCompleted

The public leaderboard is open for participating teams to monitor and compare submissions. Track links, team approval details, and submission instructions have been sent to all participating teams via email.

20 June 2026 25 June 2026 (AoE)

Leaderboard Freeze / Submission Deadline

The leaderboard is frozen and evaluation output submissions close.

1 July 2026 (AoE)

System Report Submission Deadline

Deadline for submitting the official system description paper. This deadline remains unchanged.

8 July 2026 (AoE)

SLT Challenge Paper Track: Paper Submission Deadline

Deadline for submitting challenge-related papers to the SLT Challenge Paper Track. Please follow the official SLT 2026 authors' instructions.

1 September 2026 (AoE, tentative)

SLT Challenge Paper Track: Notification of Acceptance

Acceptance decisions for papers submitted to the SLT Challenge Paper Track are announced.

Registration

Please register via this form. We will send you a successful registration email within three days after your registration. If you have not received it, please contact us.

Team registration is now closed (as of May 31, 2026, AoE). No further teams can register for the REAL-TSE Challenge.

Leaderboard account registration is open for participating teams. Please create and activate an account on the leaderboard website, complete your profile if prompted, open the corresponding track page, and click Participate to submit your team application.

  • Please make sure your team name, affiliation, and member information match the information previously submitted through the Google Form.
  • Wait for organizer approval. After approval, you will be able to submit results on the corresponding track page.

Rules for Participation

Tracks

Participants may submit results for one or both tracks of the Challenge.

Online Track — Low-Latency TSE

Key Requirements

  1. Latency constraint: The end-to-end algorithmic latency must not exceed 100 ms. Systems may use limited future context (lookahead) as long as this latency budget is satisfied.
  2. Streaming output: Models should support frame-wise or chunk-wise input with continuous output generation, without relying on full-utterance buffering.

Latency

Algorithmic Latency

The offset introduced by the whole processing chain including STFT, iSTFT, overlap-add, additional lookahead frames, etc., compared to just passing the signal through without modification. This does not include buffering latency.

  • Ex.1: A STFT-based processing with window length = 20 ms and hop length = 10 ms introduces an algorithmic delay of window length − hop length = 10 ms.
  • Ex.2: A STFT-based processing with window length = 32 ms and hop length = 8 ms introduces an algorithmic delay of window length − hop length = 24 ms. For reference, the online BSRNN baselines we provide (tfmap_context_causal_100 and spk_emb_causal_100) use this STFT configuration.
  • Ex.3: An overlap-save based processing algorithm introduces no additional algorithmic latency.
  • Ex.4: A time-domain convolution with a filter kernel size = 16 samples introduces an algorithmic latency of kernel size − 1 = 15 samples. Using one-sided (left) padding with kernel size − 1 samples, the operation introduces no additional algorithmic latency.
  • Ex.5: A STFT-based processing with window_length = 20 ms and hop_length = 10 ms using 2 future frames information introduces an algorithmic latency of (window_length − hop_length) + 2 × hop_length = 30 ms.
Buffering Latency

The latency introduced by block-wise processing, often referred to as hop-size, frame-shift, or temporal stride.

  • Ex.1: A STFT-based processing has a buffering latency corresponding to the hop size.
  • Ex.2: An overlap-save processing has a buffering latency corresponding to the frame size.
  • Ex.3: A time-domain convolution with stride 1 introduces a buffering latency of 1 sample.

Algorithmic and buffering latency definitions and examples above follow the ICASSP 2023 Deep Noise Suppression Challenge (Microsoft Research).

In addition to the reported algorithmic latency, the effective system latency will be evaluated by measuring the output response delay to a controlled input perturbation. This evaluation is designed to reflect real streaming behavior without requiring participants to explicitly implement a streaming inference pipeline.
Offline Track — Full-Context TSE

Key Requirements

  1. Full context access: Systems may utilize the entire utterance and global context information.
  2. No restriction on architecture: There is no limitation on model type, inference duration, or computational cost.
  3. Encouraged exploration: Participants are encouraged to explore high-performance architectures, long-context modeling, and generation-based methods.
Training Data & Pre-trained Models

Detailed data usage constraints—including allowed training sources, prohibited corpora, and the usage scope of the development and evaluation sets—are specified in the Data section above. All participants must comply with those requirements.

Submission Rules
Leaderboard Submission

We recommend using the public leaderboard for official submissions. Track pages are available for Track 1: Online Target Speaker Extraction and Track 2: Offline Target Speaker Extraction.

  1. Package contents: For official evaluation, submit one .zip package containing outputs for both EVAL-1 and EVAL-2. EVAL-1-only or EVAL-2-only submissions are not accepted.
  2. Evaluation target: On the submission page, select EVAL1+EVAL2 as the evaluation target and follow the Zip structure instructions shown on the page.
  3. Brief system description: Each leaderboard submission must include a brief system description in the submission form. For Track 1, also include a brief latency description covering algorithmic latency, look-ahead, buffering, or any other latency-inducing components.
  4. Official system description: For a submission to be considered valid for the final challenge, a detailed official system description must be submitted to the organizers by 1 July 2026 (AoE).
  5. Evaluation status: Evaluation runs asynchronously after upload and may take around 2-3 hours, depending on queue and system load. You can monitor progress in the submission history.
  6. Submission quota: Each team may submit up to three submissions per track. Each team may submit up to one submission per track per day, but on the final deadline day (Jun 20, 2026 Jun 25, 2026, AoE), each team may submit up to three submissions per track. Only successful submissions count toward this quota, and only the best-performing valid submission from each team will appear on the final leaderboard and be considered for the official ranking.
  7. Final ranking: The leaderboard uses the official REAL-TSE metrics: TER, Timing F1, SIM, and DNSMOS OVRL. The final ranking is computed by averaging dense ranks across the metrics.
  8. Encouraged open source: Teams are encouraged to make their models or scripts publicly available after the challenge.

The organizers reserve the right to further clarify, update, or refine the evaluation protocol if necessary.

FAQ

Yes, participants may submit results for one or both tracks of the Challenge. Each team may submit up to three submissions per track. The submission quota is one submission per track per day, but on the final deadline day (Jun 20, 2026 Jun 25, 2026, AoE), each team may submit up to three submissions per track.

Teams may use only openly available (open-source) datasets for model training. The official training splits of AliMeeting, AISHELL-4, AMI, DipCo, and CHiME6 are permitted, but their development and test sets must not be used at any stage, including pre-training, training, or data augmentation. Pre-trained models are allowed but must be clearly documented.

Note on AMI corpus splits: The AMI corpus has multiple partition schemes (see AMI dataset page). We follow the Full-corpus-ASR partition of meetings. Under this partition, the following sessions are considered development or test data and must not be used:

  • Dev (SB): ES2011, IS1008, TS3004, IB4001, IB4002, IB4003, IB4004, IB4010, IB4011
  • Test (SC / Unseen Eval): ES2004, IS1009, TS3003, EN2002
No. The official test set must not be used for training, fine-tuning, hyper-parameter tuning, or any form of model optimization. It is reserved exclusively for final evaluation. The development set may be used for hyper-parameter tuning and validation.
Yes. After creating and activating your leaderboard account, open the corresponding track page, click Participate, and submit your team application. Please use the same team name, affiliation, and member information as in the original Google Form. Submissions become available after organizer approval.
No. Official leaderboard submissions must use one .zip package containing both EVAL-1 and EVAL-2 outputs. Please select EVAL1+EVAL2 on the submission page and follow the Zip structure instructions shown there.
Please contact realtse.challenge@gmail.com with your team name, track, and a brief description of the issue. We recommend using the leaderboard whenever possible; if technical issues persist, the organizers can accept a Google Drive link to your submission package for manual evaluation, but the result may be delayed by 1-2 days.

Organizers

Shuai Wang

Nanjing University

Ke Zhang

Chinese University of Hong Kong, Shenzhen

Zihan Qian

Nanjing University

Zikai Liu

Northwestern Polytechnical University

Haoyu Li

Nanjing University

Marc Delcroix

NTT, Inc.

Zhaokai Sun

Northwestern Polytechnical University

Jiangyu Han

Brno University of Technology

Kai Yu

Shanghai Jiao Tong University

Lei Xie

Northwestern Polytechnical University

Ming Li

Chinese University of Hong Kong, Shenzhen

Haizhou Li

Chinese University of Hong Kong, Shenzhen

Contact

For any questions or inquiries, please feel free to reach out to us at:

realtse.challenge@gmail.com