Named Entity Recognition Using Support Vector Machine: A Language Independent Approach

{"references": ["N. Chinchor, \"MUC-6 Named Entity Task Definition (Version 2.1),\" in\nMUC-6, 1995.", "N. Chinchor, \"MUC-7 Named Entity Task Definition (Version 3.5),\" in\nMUC-7, 1998.", "H. Cunningham, \"GATE, a General Architecture for Text Engineering,\"\nComputers and the Humanities, vol. 36, pp. 223-254, 2002.", "D. Moldovan, S. Harabagiu, R. Girju, P. Morarescu, F. Lacatusu,\nA. Novischi, A. Badulescu, and O. Bolohan, \"LCC Tools for Question\nAnswering,\" in Text REtrieval Conference (TREC) 2002, 2002.", "B. Babych and A. Hartley, \"Improving Machine Translation Quality with\nAutomatic Named Entity Recognition,\" in Proceedings of EAMT/EACL\n2003 Workshop on MT and other Language Technology Tools, pp. 1-8,\n2003.", "S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schawartz, R. Stone,\nR. Weischedel, and the Annotation Group, \"BBN: Description of the\nSIFT System as Used for MUC-7,\" in MUC-7, (Fairfax, Virginia), 1998.", "D. M. Bikel, R. L. Schwartz, and R. M. Weischedel, \"An Algorithm\nthat Learns What-s in a Name,\" Machine Learning, vol. 34, no. 1-3,\npp. 211-231, 1999.", "A. Borthwick, Maximum Entropy Approach to Named Entity Recognition.\nPhD thesis, New York University, 1999.", "A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman,\n\"NYU:Description of the MENE Named Entity System as Used\nin MUC-7,\" in MUC-7, 1998.\n[10] S. Sekine, \"Description of the Japanese NE System used for MET-2,\"\nin MUC-7, (Fairfax, Virginia), 1998.\n[11] S. W. Bennet, C. Aone, and C. Lovell, \"Learning to Tag Multilingual\nTexts Through Observation,\" in Proceedings of Empirical Methods of\nNatural Language Processing, (Providence, Rhode Island), pp. 109-116,\n1997.\n[12] A. McCallum and W. Li, \"Early results for Named Entity Recognition\nwith Conditional Random Fields, Feature Induction and Web-enhanced\nLexicons,\" in Proceedings of CoNLL, (Canada), pp. 188-191, 2003.\n[13] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, \"Conditional Random\nFields: Probabilistic Models for Segmenting and Labeling Sequence\nData,\" in Proceedings of the 18th International Conference on Machine\nLearning (ICML), pp. 282-289, 2001.\n[14] A. Sun, \"Using Support Vector Machine for Terrorism Information Extraction,\"\nin Proceedings of the 1st NSF/NIJ Symposium on Intelligence\nand Security, 2003.\n[15] A. De Sitter and W. Daelemans, \"Information Extraction via Double\nClassification,\" in Proceedings of International Workshop on Adaptive\nText Extraction and Mining, (Dubrovnik), 2003.\n[16] N. Kushmerick, E. Johnston, and S. McGuinness, \"Information Extraction\nby Text Classification,\" in Proceedings of IJCAI-01 Workshop on\nAdaptive Text Extraction and Mining, (Seattle, WA), 2001.\n[17] A. McCallum, D. Freitag, and F. Pereira, \"Maximum Entropy Markov\nModels for Information Extraction and Segmentation,\" in Proceedings\nof the 17th International Conference on Machine Learning (ICML),\npp. 591-598, 2000.\n[18] R. Malouf, \"Markov Models for Language Independent Named Entity\nRecognition,\" in Proceedings of the 6th Conference on Natural Language\nLearning (CoNLL-2002), (Taipei, Taiwan), pp. 187-190, 2002.\n[19] J. D. Burger, J. C. Henderson, and T. Morgan, \"Statistical Named Entity\nRecognizer Adaption,\" in Proceedings of the CoNLL Workshop, (Taipei,\nTaiwan), pp. 163-166, 2002.\n[20] X. Carrears, L. Marquez, and L. Padro, \"Named Entity Recognition\nusing AdaBoost,\" in Proceedings of the CoNLL Workshop, (Taipei,\nTaiwan), pp. 167-170, 2002.\n[21] G. Zhou and J. Su, \"Named Entity Recognition using an HMM-based\nChunk Tagger,\" in Proceedings of ACL, (Philadelphia), pp. 473-480,\n2002.\n[22] H. Yamada, T. Kudo, and Y. Matsumoto, \"Japanese Named Entity\nExtraction using Support Vector Machine,\" In Transactions of IPSJ,\nvol. 43, no. 1, pp. 44-53, 2001.\n[23] T. Kudo and Y. Matsumoto, \"Chunking with Support Vector Machines,\"\nin Proceed-ings of NAACL, pp. 192-199, 2001.\n[24] K. Takeuchi and N. Collier, \"Use of Support Vector Machines in Extended\nNamed Entity Recognition,\" in Proceedings of the 6th Conference\non Natural Language Learning (CoNLL-2002), pp. 119-125, 2002.\n[25] A. Masayuki and Y. Matsumoto, \"Japanese Named Entity Extraction\nwith Redundant Morphological Analysis,\" in NAACL -03: Proceedings\nof the 2003 Conference of the North American Chapter of the Association\nfor Computational Linguistics on Human Language Technology,\n(Morristown, NJ, USA), pp. 8-15, Association for Computational Linguistics,\n2003.\n[26] A. Ekbal and S. Bandyopadhyay, \"Pattern Based Bootstrapping Method\nfor Named Entity Recognition,\" in Proceedings of the 6th International\nConference on Advances in Pattern Recognition (ICAPR), pp. 349-355,\nWorld Scientific, 2007.\n[27] A. Ekbal and S. Bandyopadhyay, \"Lexical Pattern Learning from Corpus\nData for Named Entity Recognition,\" in Proceedings of 5th International\nConference on Natural Language Processing (ICON), (India), pp. 123-\n128, 2007.\n[28] A. Ekbal, S. Naskar, and S. Bandyopadhyay, \"Named Entity Recognition\nand Transliteration in Bengali,\" Named Entities: Recognition, Classification\nand Use, Special Issue of Lingvisticae Investigationes Journal,\nvol. 30, no. 1, pp. 95-114, 2007.\n[29] A. Ekbal and S. Bandyopadhyay, \"Bengali Named Entity Recognition\nusing Support Vector Machine,\" in Proceedings of Workshop on NER\nfor South and South East Asian Languages, 3rd International Joint\nConference on Natural Languge Processing (IJCNLP), (India), pp. 51-\n58, 2008.\n[30] W. Li and A. McCallum, \"Rapid Development of Hindi Named Entity\nRecognition using Conditional Random Fields and Feature Induction,\"\nACM Transactions on Asian Languages Information Processing, vol. 2,\nno. 3, pp. 290-294, 2004.\n[31] A. Ekbal and S. Bandyopadhyay, \"A Hidden Markov Model Based\nNamed Entity Recognition System: Bengali and Hindi as Case Studies,\"\nin Proceedings of the 2nd International Conference on Pattern Recognition\nand Machine Intelligence (PReMI 2007), pp. 545-552, Springer\nVerlag, 2007.\n[32] V. N. Vapnik, The nature of statistical learning theory. New York, NY,\nUSA: Springer-Verlag New York, Inc., 1995.\n[33] C. C and V. N. Vapnik, \"Support Vector Networks,\" Machine Learning,\nvol. 20, pp. 273-297, 1995.\n[34] T. Joachims, \"Making large-scale support vector machine learning\npractical,\" pp. 169-184, 1999.\n[35] H. Taira and M. Haruno, \"Feature Selection in SVM Text Categorization,\"\nin Proceedings of AAAI-99, 1999.\n[36] A. Ekbal and S. Bandyopadhyay, \"A Web-based Bengali News Corpus\nfor Named Entity Recognition,\" Language Resources and Evaluation\nJournal, vol. 42, no. 2, 2008.\n[37] M. Collins and Y. Singer, \"Unsupervised models for named entity\nclassification,\" in Proceedings of the Joint SIGDAT Conference on\nEmpirical Methods in Natural Language Processing and Very Large\nCorpora, 1999.\n[38] S. Cucerzon and D. Yarowsky, \"Language Independent Named Entity\nRecognition Combining Morphological and Contextual Evidence,\" in\nProceedings of the 1999 Joint SIGDAT conference on EMNLP and VLC,\n(Washington, D.C.), 1999.\n[39] S. Cucerzan and D. Yarowsky, \"Language Independent NER using a\nUnified Model of Internal and Contextual Evidence,\" in Proceedings of\nCoNLL 2002, pp. 171-175, 2002.\n[40] W. Phillips and E. Riloff, \"Exploiting Strong Syntactic Heuristics and\nCo-training to Learn Semantic Lexicons,\" in EMNLP -02: Proceedings\nof the ACL-02 conference on Empirical methods in natural language\nprocessing, (Morristown, NJ, USA), pp. 125-132, Association for Computational\nLinguistics, 2002.\n[41] E. Riloff and R. Jones, \"Learning Dictionaries for Information Extraction\nby Multi-level Bootstrapping,\" in AAAI -99/IAAI -99: Proceedings of the\nsixteenth national conference on Artificial intelligence and the eleventh\nInnovative applications of artificial intelligence conference innovative\napplications of artificial intelligence, (Menlo Park, CA, USA), pp. 474-\n479, American Association for Artificial Intelligence, 1999.\n[42] M. Thelen and E. Riloff, \"A Bootstrapping Method for Learning\nSemantic Lexicons using Extraction Pattern Contexts,\" in EMNLP -02:\nProceedings of the ACL-02 conference on Empirical methods in natural\nlanguage processing, (Morristown, NJ, USA), pp. 214-221, Association\nfor Computational Linguistics, 2002.\n[43] T. Strzalkowski and J. Wang, \"A Self-learning Universal Concept Spotter,\"\nin Proceedings of the 16th conference on Computational linguistics,\n(Morristown, NJ, USA), pp. 931-936, Association for Computational\nLinguistics, 1996.\n[44] R. Yangarber, W. Lin, and R. Grishman, \"Unsupervised Learning of\nGeneralized Names,\" in Proceedings of the 19th international conference\non Computational linguistics, (Morristown, NJ, USA), pp. 1-7,\nAssociation for Computational Linguistics, 2002.\n[45] A. Ekbal, R. Haque, and S. Bandyopadhyay, \"Bengali Part of Speech\nTagging using Conditional Random Field,\" in Proceedings of Seventh\nInternational Symposium on Natural Language Processing (SNLP2007),\n2007.\n[46] A. Ekbal and S. Bandyopadhyay, \"Lexicon Development and POS\nTagging using a Tagged Bengali News Corpus,\" in Proceedings of the\n20th International Florida AI Research Society Conference (FLAIRS-\n2007), (Florida), pp. 261-263, 2007.\n[47] T. W. Anderson and S. Scolve, Introduction to the Statistical Analysis\nof Data. Houghton Mifflin, 1978.\n[48] W. S. Gosset, \"The Probable Error of a Mean,\" in Biometrika, vol. 6,\npp. 1-25, 1908."]}

Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes 1, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) 2. In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages.

Keywords

Named Entity Recognition (NER), Hindi., Bengali, Named Entity (NE), Support Vector Machine (SVM)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average