How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

Name: How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis
Keywords: Machine Learning, FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Artificial Intelligence, Cryptography and Security, Computation and Language, Computation and Language (cs.CL), Cryptography and Security (cs.CR), Machine Learning (cs.LG)

Mostafa, Ahmed; Nahid, Raisul Arefin; Mulder, Samuel

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

https://doi.org/10.14722/bar.2...

Article . 2025 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2025

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2025Embargo end date: 01 Jan 2025Publisher:Internet SocietyJournal:Proceedings 2025 Workshop on Binary Analysis Research

Authors: Mostafa, Ahmed; Nahid, Raisul Arefin; Mulder, Samuel;

doi: 10.14722/bar.2025.23013 , 10.48550/arxiv.2511.03825

arXiv: 2511.03825

How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

- Summary
- Subjects
- Metrics

Abstract

Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction -- a critical problem in binary code analysis. To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows.

Publication Notice. This paper was published in the BAR 2025 Workshop (with NDSS 2025) and is for research and educational use. Copyright \c{opyright} 2025 Internet Society. All rights reserved. Personal/classroom reproduction is permitted with this notice and full paper citation. All other uses, including commercial, require prior written permission from the Internet Society

Keywords

Machine Learning, FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Artificial Intelligence, Cryptography and Security, Computation and Language, Computation and Language (cs.CL), Cryptography and Security (cs.CR), Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

Related to Research communities

Knowmad Institut