Powered by OpenAIRE graph
Found an issue? Give us feedback
addClaim

A framework for text categorization

Authors: Williams, Ken;

A framework for text categorization

Abstract

The field of automatic Text Categorization (TC) concerns the creation of cat­egorizer functions, usually involving Machine Learning techniques, to assign labels from a pre-defined set of categories to documents based on the docu­ments' content. Because of the many variations on how this can be achieved and the diversity of applications in which it can be employed, creating specific TC applications is often a difficult task. This thesis concerns the design, implementation, and testing of an Object­Oriented Application Framework for Text Categorization. By encoding exper­tise in the architecture of the framework, many of the barriers to creating TC applications are eliminated. Developers can focus on the domain-specific as­pects of their applications, leaving the generic aspects of categorization to the framework. This allows significant code and design reuse when building new applications. Chapter 1 provides an introduction to automatic Text Categorization, Ob­ject-Oriented Application Frameworks, and Design Patterns. Some common application areas and benefits of using automatic TC are discussed. Frame­works are defined and their advantages compared to other software engineering strategies are presented. Design patterns are defined and placed in the context of framework development. An overview of three related products in the TC space, Weka, Autonomy, and Teragram, follows. Chapter 2 contains a detailed presentation of Text Categorization. TC is formally defined, followed by a detailed account of the main functional areas in Text Categorization that a modern TC framework must provide. These include document tokenizing, feature selection and reduction, Machine Learn­ing techniques, and categorization runtime behavior. Four Machine Learning techniques (Na"ive Bayes categorizers, k-Nearest-Neighbor categorizers, Support Vector Machines, and Decision Trees) are presented, with discussions of their core algorithms and the computational complexity involved. Several measures for evaluating the quality of a categorizer are then defined, including precision, recall, and the Ff3 measure. The design of a framework that addresses the functional areas from Chap­ter 2 is presented in Chapter 3. This design is motivated by consideration of the framework's audience and some expected usage scenarios. The core archi­tectural classes in the framework are then presented, and Design Patterns are employed in a detailed discussion of the cooperative relationships among frame­work classes. This is the first known use of Design Patterns in an academic work on Text Categorization software. Following the presentation of the framework design, some possible design limitations are discussed. The design in Chapter 3 has been implemented as the AI: : Categorizer Perl package. Chapter 4 is a short discussion of implementation issues, includ­ing considerations in choosing the programming language. Special consideration is given to the implementation of constructor methods in the framework, since they are responsible for enforcing the structural relationships among framework classes. Three data structure issues within the framework are then discussed: feature vectors, sets of document or category objects, and the serialized repre­sentation of a framework object. Chapter 5 evaluates the framework from several different perspectives on two corpora. The first corpus is the standard Reuters-21578 benchmark corpus, and the second is assembled from messages sent to an educational ask-an-expert service. Using these corpora, the framework is evaluated on the measures in­troduced in Chapter 2. The performance on the first corpus is compared to the well-known results in [50]. The Nai·ve Bayes categorizer is found to be competitive with standard implementations in the literature, and the Support Vector Machine and k-Nearest-Neighbor implementations are outperformed by comparable systems by other researchers. The framework is then evaluated in terms of its resource usage, and several applications using AI: : Categorizer are presented in order to show the framework's ability to function in the usage scenarios discussed in Chapter 3.

Country
Australia
Related Organizations
Keywords

Information storage and retrieval systems, 005, Computational linguistics, Text processing (Computer science)

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!