Speech Watermarking with Discrete Intermediate Representations

Name: Speech Watermarking with Discrete Intermediate Representations
Keywords: Signal Processing (eess.SP), FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)

Ji, Shengpeng; Jiang, Ziyue; Zuo, Jialong; Fang, Minghui; Chen, Yifu; Jin, Tao; Zhao, Zhou

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

Proceedings of the AAAI Conference on Artificial Intelligence

Article . 2025 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2024

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

Speech Watermarking with Discrete Intermediate Representations

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 11 Apr 2025Embargo end date: 01 Jan 2024Publisher:Association for the Advancement of Artificial Intelligence (AAAI)Journal:Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24,239-24,247 (issn: 2159-5399, eissn: 2374-3468,

Copyright policy )

Authors: Ji, Shengpeng; Jiang, Ziyue; Zuo, Jialong; Fang, Minghui; Chen, Yifu; Jin, Tao; Zhao, Zhou;

doi: 10.1609/aaai.v39i23.34600 , 10.48550/arxiv.2412.13917

arXiv: 2412.13917

Speech Watermarking with Discrete Intermediate Representations

- Summary
- Subjects
- Metrics

Abstract

Speech watermarking techniques can proactively mitigate the potential harmful consequences of instant voice cloning techniques. These techniques involve the insertion of signals into speech that are imperceptible to humans but can be detected by algorithms. Previous approaches typically embed watermark messages into continuous space. However, intuitively, embedding watermark information into robust discrete latent space can significantly improve the robustness of watermarking systems. In this paper, we propose DiscreteWM, a novel speech watermarking framework that injects watermarks into the discrete intermediate representations of speech. Specifically, we map speech into discrete latent space with a vector-quantized autoencoder and inject watermarks by changing the modular arithmetic relation of discrete IDs. To ensure the imperceptibility of watermarks, we also propose a manipulator model to select the candidate tokens for watermark embedding. Experimental results demonstrate that our framework achieves state-of-the-art performance in robustness and imperceptibility, simultaneously. Moreover, our flexible frame-wise approach can serve as an efficient solution for both voice cloning detection and information hiding. Additionally, DiscreteWM can encode 1 to 150 bits of watermark information within a 1-second speech clip, indicating its encoding capacity.

Related Organizations

Zhejiang University
Zhejiang Ocean University
China (People's Republic of)
ZHEJIANG UNIVERSITY
Zhejiang University
Zhejiang University

View all View all

Keywords

Signal Processing (eess.SP), FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green