Evaluating two methods for Treebank grammar compaction

Article English OPEN
Krotov, A. ; Hepple, M. ; Gaizauskas, R. ; Wilks, Y. (1999)
  • Publisher: Cambridge University Press
  • Subject:
    acm: TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES

Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar.\ud \ud In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision.
  • References (19)
    19 references, page 1 of 2

    Bies, A., Ferguson, M., Katz, K. and MacIntyre, R. (1995) Bracketing Guidelines for Treebank II Style Penn Treebank Project. Available at: ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual.

    Bod, R. (1992) A computational model of language performance: Data Oriented Parsing. Proceedings of COLING'92, pp. 855-859. Nantes, France.

    Bod, R. (1993) Using an annotated corpus as a stochastic grammar. Proceedings of European Chapter of the Association for Computational Linguistics '93, Utrecht, The Netherlands.

    Bonnema, R., Bod, R. and Scha, R. (1997) A DOP model for semantic interpretation. Proceedings of European Chapter of the Association for Computational Linguistics, pp. 159- 167.

    Charniak, E. (1996) Tree-bank grammars. Proceedings 13th National Conference on Artificial Intelligence (AAAI-96), pp. 1031-1036. MIT Press.

    Charniak, E. (1997a) Statistical parsing with a context-free grammar and word statistics. Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97). MIT Press.

    Charniak, E. (1997b) Statistical techniques for natural language parsing. AI Magazine. 18(4): 33-44.

    Collins, M. (1996) A new statistical parser based on bigram lexical dependencies. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 184-191.

    Gaizauskas, R. (1995) Investigations into the grammar underlying the Penn Treebank II. Research Memorandum CS-95-25, University of Sheffield.

    Johnson, M. (1998) PCFG models of linguistic tree representations. Computational Linguistics, 24(4): 613-632.

  • Metrics
    0
    views in OpenAIRE
    0
    views in local repository
    12
    downloads in local repository

    The information is available from the following content providers:

    From Number Of Views Number Of Downloads
    White Rose Research Online - IRUS-UK 0 12
Share - Bookmark