A method for compressing lexicons

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 25 Jun 2003 France Publisher:IEEE Comput. SocJournal:Proceedings DCC 2002. Data Compression Conference

Authors: Ristov, Strahil; Laporte, Eric;

doi: 10.1109/dcc.2002.1000013

A method for compressing lexicons

- Summary
- Subjects
- Metrics

Abstract

Summary form only given. Lexicon lookup is an essential part of almost every natural language processing system. A natural language lexicon is a set of strings where each string consists of a word and the associated linguistic data. Its computer representation is a structure that returns appropriate linguistic data on a given input word. It should be small and fast. We propose a method for lexicon compression based on a very efficient trie compression method and the inverted file paradigm. The method was applied on a 664000 string, 18 Mbyte, French phonetic and grammatical electronic dictionary for spelling-to-phonetics conversion. Entries in the lexicon are strings consisting of a word, its phonetic transcription, and some additional codes.

Country

France

Related Organizations

French National Centre for Scientific Research
France
University of Marne la Vallée
France
Laboratoire d'Informatique Gaspard-Monge
France
Ruđer Bošković Institute
Croatia
UNIVERSITE GUSTAVE EIFFEL
France

View all View all

Keywords

[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL]

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average