PHE-Net: Envelope-Guided Speaker Extraction with Unlimited Speaker Scalability via WavLM-Based Discovery

waris, dariush

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

PHE-Net: Envelope-Guided Speaker Extraction with Unlimited Speaker Scalability via WavLM-Based Discovery

descriptionPublicationkeyboard_double_arrow_right Preprint Under curationPublisher:Zenodo

Authors: waris, dariush;

doi: 10.5281/zenodo.19675768

PHE-Net: Envelope-Guided Speaker Extraction with Unlimited Speaker Scalability via WavLM-Based Discovery

- Summary

Abstract

We present PHE-Net, a modular voice extraction system that separates individual speakers from single-channel mixtures of 2 to 20 simultaneous talkers. The system achieves +18.27 dB SI-SNRi with oracle guidance, scaling from N=2 to N=20 with zero degradation. In fully blind evaluation, +8.20 dB SI-SNRi at N=10 speakers with no enrollment audio. Through systematic ablation, we discover that the spectral envelope channel alone determines extraction quality — speaker embeddings are provably ignored (cosine 0.50 = cosine 1.00), and F0 pitch contributes nothing when envelope is sufficient (zero-F0 ceiling = +16.25 dB at N=10). This finding simplifies the research problem to a single well-defined challenge: improving blind spectral envelope estimation from multi-speaker mixtures.

Found an issue? Give us feedback