A framework for text categorization

The field of automatic Text Categorization (TC) concerns the creation of categorizer functions, usually involving Machine Learning techniques, to assign labels from a pre-defined set of categories to documents based on the documents' content. Because of the many variations on how this can be achieved and the diversity of applications in which it can be employed, creating specific TC applications is often a difficult task. This thesis concerns the design, implementation, and testing of an ObjectOriented Application Framework for Text Categorization. By encoding expertise in the architecture of the framework, many of the barriers to creating TC applications are eliminated. Developers can focus on the domain-specific aspects of their applications, leaving the generic aspects of categorization to the framework. This allows significant code and design reuse when building new applications. Chapter 1 provides an introduction to automatic Text Categorization, Object-Oriented Application Frameworks, and Design Patterns. Some common application areas and benefits of using automatic TC are discussed. Frameworks are defined and their advantages compared to other software engineering strategies are presented. Design patterns are defined and placed in the context of framework development. An overview of three related products in the TC space, Weka, Autonomy, and Teragram, follows. Chapter 2 contains a detailed presentation of Text Categorization. TC is formally defined, followed by a detailed account of the main functional areas in Text Categorization that a modern TC framework must provide. These include document tokenizing, feature selection and reduction, Machine Learning techniques, and categorization runtime behavior. Four Machine Learning techniques (Na"ive Bayes categorizers, k-Nearest-Neighbor categorizers, Support Vector Machines, and Decision Trees) are presented, with discussions of their core algorithms and the computational complexity involved. Several measures for evaluating the quality of a categorizer are then defined, including precision, recall, and the Ff3 measure. The design of a framework that addresses the functional areas from Chapter 2 is presented in Chapter 3. This design is motivated by consideration of the framework's audience and some expected usage scenarios. The core architectural classes in the framework are then presented, and Design Patterns are employed in a detailed discussion of the cooperative relationships among framework classes. This is the first known use of Design Patterns in an academic work on Text Categorization software. Following the presentation of the framework design, some possible design limitations are discussed. The design in Chapter 3 has been implemented as the AI: : Categorizer Perl package. Chapter 4 is a short discussion of implementation issues, including considerations in choosing the programming language. Special consideration is given to the implementation of constructor methods in the framework, since they are responsible for enforcing the structural relationships among framework classes. Three data structure issues within the framework are then discussed: feature vectors, sets of document or category objects, and the serialized representation of a framework object. Chapter 5 evaluates the framework from several different perspectives on two corpora. The first corpus is the standard Reuters-21578 benchmark corpus, and the second is assembled from messages sent to an educational ask-an-expert service. Using these corpora, the framework is evaluated on the measures introduced in Chapter 2. The performance on the first corpus is compared to the well-known results in [50]. The Nai·ve Bayes categorizer is found to be competitive with standard implementations in the literature, and the Support Vector Machine and k-Nearest-Neighbor implementations are outperformed by comparable systems by other researchers. The framework is then evaluated in terms of its resource usage, and several applications using AI: : Categorizer are presented in order to show the framework's ability to function in the usage scenarios discussed in Chapter 3.

Country

Australia

Related Organizations

University of Sydney
Australia

Keywords

Information storage and retrieval systems, 005, Computational linguistics, Text processing (Computer science)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now