CodeCipher: Learning to Obfuscate Source Code Against LLMs

Name: CodeCipher: Learning to Obfuscate Source Code Against LLMs
Keywords: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)

Yalan Lin; Chengcheng Wan 0001; Yixiong Fang; Xiaodong Gu 0002

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2024

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

CodeCipher: Learning to Obfuscate Source Code Against LLMs

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2024Embargo end date: 01 Jan 2024Publisher:arXivJournal:CoRR, volume abs/2410.05797

Authors: Yalan Lin; Chengcheng Wan 0001; Yixiong Fang; Xiaodong Gu 0002;

doi: 10.48550/arxiv.2410.05797

arXiv: 2410.05797

CodeCipher: Learning to Obfuscate Source Code Against LLMs

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

While large code language models have made significant strides in AI-assisted coding tasks, there are growing concerns about privacy challenges. The user code is transparent to the cloud LLM service provider, inducing risks of unauthorized training, reading, and execution of the user code. In this paper, we propose CodeCipher, a novel method that perturbs privacy from code while preserving the original response from LLMs. CodeCipher transforms the LLM's embedding matrix so that each row corresponds to a different word in the original matrix, forming a token-to-token confusion mapping for obfuscating source code. The new embedding matrix is optimized by minimizing the task-specific loss function. To tackle the challenge of the discrete and sparse nature of word vector spaces, CodeCipher adopts a discrete optimization strategy that aligns the updated vector to the nearest valid token in the vocabulary before each gradient update. We demonstrate the effectiveness of our approach on three AI-assisted coding tasks including code completion, summarization, and translation. Results show that our model successfully confuses the privacy in source code while preserving the original LLM's performance.

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)

1 Research products, page 1 of 1

google-java-format software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

CodeCipher: Learning to Obfuscate Source Code Against LLMs

CodeCipher: Learning to Obfuscate Source Code Against LLMs

1 Research products, page 1 of 1

google-java-format software on GitHub