
We present PHE-Net, a modular voice extraction system that separates individual speakers from single-channel mixtures of 2 to 20 simultaneous talkers. The system achieves +18.27 dB SI-SNRi with oracle guidance, scaling from N=2 to N=20 with zero degradation. In fully blind evaluation, +8.20 dB SI-SNRi at N=10 speakers with no enrollment audio. Through systematic ablation, we discover that the spectral envelope channel alone determines extraction quality — speaker embeddings are provably ignored (cosine 0.50 = cosine 1.00), and F0 pitch contributes nothing when envelope is sufficient (zero-F0 ceiling = +16.25 dB at N=10). This finding simplifies the research problem to a single well-defined challenge: improving blind spectral envelope estimation from multi-speaker mixtures.
