publication . Article . Preprint . 2011

LexRank: Graph-based Lexical Centrality as Salience in Text Summarization

Erkan, G.; Radev, D. R.;
Open Access English
  • Published: 09 Sep 2011
Abstract
We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test the technique on the problem of Text Summarization (TS). Extractive TS relies on the concept of sentence salience to identify the most important sentences in a document or set of documents. Salience is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence. We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. Our system, based on LexRank ranked in first place in more than one task in the recent DUC 2004 evaluation. In this paper we present a detailed analysis of our approach and apply it to a larger data set including data from earlier DUC evaluations. We discuss several methods to compute centrality using the similarity graph. The results show that degree-based methods (including LexRank) outperform both centroid-based methods and other systems participating in DUC in most of the cases. Furthermore, the LexRank with threshold method outperforms the other degree-based techniques including continuous LexRank. We also show that our approach is quite insensitive to the noise in the data that may result from an imperfect topical clustering of documents.
Persistent Identifiers
Subjects
ACM Computing Classification System: ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
free text keywords: Computer Science - Computation and Language, Artificial Intelligence, Adjacency matrix, Sentence, Graph (abstract data type), Cluster analysis, Computer science, Graph, Centrality, Ranking, Natural language processing, computer.software_genre, computer, Cosine similarity, Artificial intelligence, business.industry, business, Eigenvector centrality, Automatic summarization
Related Organizations
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
NSF| Probabilistic and link-based Methods for Exploiting Very Large Textual Repositories
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 0329043
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Information and Intelligent Systems
37 references, page 1 of 3

Barzilay, R., & Elhadad, M. (1999). Using Lexical Chains for Text Summarization. In Mani, I., & Maybury, M. T. (Eds.), Advances in Automatic Text Summarization, pp. 111-121. The MIT Press.

Barzilay, R., & Lee, L. (2003). Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In Proceedings of HLT-NAACL.

Baxendale, P. (1958). Man-made index for technical litterature - an experiment. IBM J. Res. Dev., 2 (4), 354-361.

Brandow, R., Mitze, K., & Rau, L. F. (1995). Automatic condensation of electronic publications by sentence selection. Information Processing and Management, 31 (5), 675-685. [OpenAIRE]

Brew, C., & im Walde, S. S. (2002). Spectral clustering for german verbs. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30 (1-7), 107-117.

Carbonell, J. G., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Research and Development in Information Retrieval, pp. 335-336.

Collins, M. (1997). Three generative, lexicalised models for statistical parsing. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.

Daum´e III, H., & Marcu, D. (2004). A phrase-based hmm approach to document/abstract alignment. In Lin, D., & Wu, D. (Eds.), Proceedings of EMNLP 2004, pp. 119-126 Barcelona, Spain. Association for Computational Linguistics.

Edmundson, H. (1969). New Methods in Automatic Extracting. Journal of the Association for Computing Machinery, 16 (2), 264-285. [OpenAIRE]

Erkan, G., & Radev, D. R. (2004a). Lexpagerank: Prestige in multi-document text summarization. In Lin, D., & Wu, D. (Eds.), Proceedings of EMNLP 2004, pp. 365-371 Barcelona, Spain. Association for Computational Linguistics.

Erkan, G., & Radev, D. R. (2004b). The University of Michigan at DUC 2004. In Proceedings of the Document Understanding Conferences Boston, MA.

Hatzivassiloglou, V., Klavans, J., Holcombe, M., Barzilay, R., Kan, M., & McKeown, K. (2001). Simfinder: A flexible clustering tool for summarization.. [OpenAIRE]

Jing, H. (2002). Using hidden markov modeling to decompose Human-Written summaries. CL, 28 (4), 527-543.

Knight, K., & Marcu, D. (2000). Statistics-based summarization - step one: Sentence compression. In Proceeding of The 17th National Conference of the American Association for Artificial Intelligence (AAAI-2000), pp. 703-710.

37 references, page 1 of 3
Abstract
We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test the technique on the problem of Text Summarization (TS). Extractive TS relies on the concept of sentence salience to identify the most important sentences in a document or set of documents. Salience is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence. We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. Our system, based on LexRank ranked in first place in more than one task in the recent DUC 2004 evaluation. In this paper we present a detailed analysis of our approach and apply it to a larger data set including data from earlier DUC evaluations. We discuss several methods to compute centrality using the similarity graph. The results show that degree-based methods (including LexRank) outperform both centroid-based methods and other systems participating in DUC in most of the cases. Furthermore, the LexRank with threshold method outperforms the other degree-based techniques including continuous LexRank. We also show that our approach is quite insensitive to the noise in the data that may result from an imperfect topical clustering of documents.
Persistent Identifiers
Subjects
ACM Computing Classification System: ComputingMethodologies_DOCUMENTANDTEXTPROCESSING
free text keywords: Computer Science - Computation and Language, Artificial Intelligence, Adjacency matrix, Sentence, Graph (abstract data type), Cluster analysis, Computer science, Graph, Centrality, Ranking, Natural language processing, computer.software_genre, computer, Cosine similarity, Artificial intelligence, business.industry, business, Eigenvector centrality, Automatic summarization
Related Organizations
Communities
Communities with gateway
OpenAIRE Connect image
Funded by
NSF| Probabilistic and link-based Methods for Exploiting Very Large Textual Repositories
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 0329043
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Information and Intelligent Systems
37 references, page 1 of 3

Barzilay, R., & Elhadad, M. (1999). Using Lexical Chains for Text Summarization. In Mani, I., & Maybury, M. T. (Eds.), Advances in Automatic Text Summarization, pp. 111-121. The MIT Press.

Barzilay, R., & Lee, L. (2003). Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In Proceedings of HLT-NAACL.

Baxendale, P. (1958). Man-made index for technical litterature - an experiment. IBM J. Res. Dev., 2 (4), 354-361.

Brandow, R., Mitze, K., & Rau, L. F. (1995). Automatic condensation of electronic publications by sentence selection. Information Processing and Management, 31 (5), 675-685. [OpenAIRE]

Brew, C., & im Walde, S. S. (2002). Spectral clustering for german verbs. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30 (1-7), 107-117.

Carbonell, J. G., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Research and Development in Information Retrieval, pp. 335-336.

Collins, M. (1997). Three generative, lexicalised models for statistical parsing. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.

Daum´e III, H., & Marcu, D. (2004). A phrase-based hmm approach to document/abstract alignment. In Lin, D., & Wu, D. (Eds.), Proceedings of EMNLP 2004, pp. 119-126 Barcelona, Spain. Association for Computational Linguistics.

Edmundson, H. (1969). New Methods in Automatic Extracting. Journal of the Association for Computing Machinery, 16 (2), 264-285. [OpenAIRE]

Erkan, G., & Radev, D. R. (2004a). Lexpagerank: Prestige in multi-document text summarization. In Lin, D., & Wu, D. (Eds.), Proceedings of EMNLP 2004, pp. 365-371 Barcelona, Spain. Association for Computational Linguistics.

Erkan, G., & Radev, D. R. (2004b). The University of Michigan at DUC 2004. In Proceedings of the Document Understanding Conferences Boston, MA.

Hatzivassiloglou, V., Klavans, J., Holcombe, M., Barzilay, R., Kan, M., & McKeown, K. (2001). Simfinder: A flexible clustering tool for summarization.. [OpenAIRE]

Jing, H. (2002). Using hidden markov modeling to decompose Human-Written summaries. CL, 28 (4), 527-543.

Knight, K., & Marcu, D. (2000). Statistics-based summarization - step one: Sentence compression. In Proceeding of The 17th National Conference of the American Association for Artificial Intelligence (AAAI-2000), pp. 703-710.

37 references, page 1 of 3
Any information missing or wrong?Report an Issue