Tokenization as Finite-State Transduction

Name: Tokenization as Finite-State Transduction
Keywords: FOS: Computer and information sciences, Computer Science - Computation and Language, Formal Languages and Automata Theory (cs.FL), Computer Science - Formal Languages and Automata Theory, Computation and Language (cs.CL)

Marco Cognetta; Naoaki Okazaki

Found an issue? Give us feedback

Computational Lingui...arrow_drop_down

Computational Linguistics

Article . 2025 . Peer-reviewed

License: CC BY NC ND

Data sources: Crossref

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2024

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

Tokenization as Finite-State Transduction

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 24 Nov 2025Embargo end date: 01 Jan 2024 English Publisher:MIT PressJournal:Computational Linguistics (issn: 0891-2017, eissn: 1530-9312,

Copyright policy )

Authors: Marco Cognetta; Naoaki Okazaki;

doi: 10.1162/coli.a.23 , 10.48550/arxiv.2410.15696

arXiv: 2410.15696

Tokenization as Finite-State Transduction

- Summary
- Subjects
- Metrics

Abstract

Abstract Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework that can encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, are also efficiently representable by simple finite-state transducers. For BPE, this is particularly surprising given that it does not tokenize strings from left to right and requires a notion of priority. We also discuss an application of subword-level pattern promotion to guided generation, where the outputs of a language model are constrained to match a specified pattern, and how tokenization-aware promotion offers a theoretical benefit to modeling.

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Formal Languages and Automata Theory (cs.FL), Computer Science - Formal Languages and Automata Theory, Computation and Language (cs.CL)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

gold