publication . Conference object . Preprint . 2015

Splitting Compounds by Semantic Analogy

Daiber, Joachim; Quiroz, Lautaro; Wechsler, Roger; Frank, Stella;
Open Access English
  • Published: 15 Sep 2015
  • Publisher: Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
  • Country: Germany
Abstract
Compounding is a highly productive word-formation process in some languages that is often problematic for natural language processing applications. In this paper, we investigate whether distributional semantics in the form of word embeddings can enable a deeper, i.e., more knowledge-rich, processing of compounds than the standard string-based methods. We present an unsupervised approach that exploits regularities in the semantic vector space (based on analogies such as "bookshop is to shop as bookshelf is to shelf") to produce compound analyses of high quality. A subsequent compound splitting algorithm based on these analyses is highly effective, particularly fo...
Subjects
free text keywords: Computer Science - Computation and Language
Related Organizations
Funded by
EC| QT21
Project
QT21
QT21: Quality Translation 21
  • Funder: European Commission (EC)
  • Project Code: 645452
  • Funding stream: H2020 | RIA
19 references, page 1 of 2

[Cap et al.2014] Fabienne Cap, Alexander Fraser, Marion Weller, and Aoife Cahill. 2014. How to produce unseen teddy bears: Improved morphological processing of compounds in SMT. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

[Denkowski and Lavie2014] Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation.

[Federico et al.2008] Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In Proceedings of Interspeech 2008 - 9th Annual Conference of the International Speech Communication Association.

[Fraser et al.2012] Alexander Fraser, Marion Weller, Aoife Cahill, and Fabienne Cap. 2012. Modeling inflection and word-formation in SMT. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

[Fritzinger and Fraser2010] Fabienne Fritzinger and Alexander Fraser. 2010. How to avoid burning ducks: Combining linguistic analysis and corpus statistics for German compound processing. In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics (MATR).

[Henrich and Hinrichs2011] Verena Henrich and Erhard W. Hinrichs. 2011. Determining immediate constituents of compounds in GermaNet. In Proceedings of the International Conference on Recent Advances in Natural Language Processing 2011.

[Koehn and Knight2003] Philipp Koehn and Kevin Knight. 2003. Empirical methods for compound splitting. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

[Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL).

[Koehn2004] Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 9th Conference on Empirical Methods in Natural Language Processing (EMNLP).

[Lieber and Sˇtekauer2009] Rochelle Lieber and Pavol Sˇtekauer. 2009. The Oxford handbook of compounding. Oxford University Press.

[Mikolov et al.2013] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

[Nießen and Ney2000] Sonja Nießen and Hermann Ney. 2000. Improving SMT quality with morpho-syntactic analysis. In Proceedings of the 18th International Conference on Computational Linguistics (COLING).

[Och and Ney2003] Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19-51.

[Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL).

[Popovic´ et al.2006] Maja Popovic´, Daniel Stein, and Hermann Ney. 2006. Statistical machine translation of German compound words. In Proceedings of FinTal - 5th International Conference on Natural Language Processing.

19 references, page 1 of 2
Abstract
Compounding is a highly productive word-formation process in some languages that is often problematic for natural language processing applications. In this paper, we investigate whether distributional semantics in the form of word embeddings can enable a deeper, i.e., more knowledge-rich, processing of compounds than the standard string-based methods. We present an unsupervised approach that exploits regularities in the semantic vector space (based on analogies such as "bookshop is to shop as bookshelf is to shelf") to produce compound analyses of high quality. A subsequent compound splitting algorithm based on these analyses is highly effective, particularly fo...
Subjects
free text keywords: Computer Science - Computation and Language
Related Organizations
Funded by
EC| QT21
Project
QT21
QT21: Quality Translation 21
  • Funder: European Commission (EC)
  • Project Code: 645452
  • Funding stream: H2020 | RIA
19 references, page 1 of 2

[Cap et al.2014] Fabienne Cap, Alexander Fraser, Marion Weller, and Aoife Cahill. 2014. How to produce unseen teddy bears: Improved morphological processing of compounds in SMT. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

[Denkowski and Lavie2014] Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation.

[Federico et al.2008] Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In Proceedings of Interspeech 2008 - 9th Annual Conference of the International Speech Communication Association.

[Fraser et al.2012] Alexander Fraser, Marion Weller, Aoife Cahill, and Fabienne Cap. 2012. Modeling inflection and word-formation in SMT. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

[Fritzinger and Fraser2010] Fabienne Fritzinger and Alexander Fraser. 2010. How to avoid burning ducks: Combining linguistic analysis and corpus statistics for German compound processing. In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics (MATR).

[Henrich and Hinrichs2011] Verena Henrich and Erhard W. Hinrichs. 2011. Determining immediate constituents of compounds in GermaNet. In Proceedings of the International Conference on Recent Advances in Natural Language Processing 2011.

[Koehn and Knight2003] Philipp Koehn and Kevin Knight. 2003. Empirical methods for compound splitting. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

[Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL).

[Koehn2004] Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 9th Conference on Empirical Methods in Natural Language Processing (EMNLP).

[Lieber and Sˇtekauer2009] Rochelle Lieber and Pavol Sˇtekauer. 2009. The Oxford handbook of compounding. Oxford University Press.

[Mikolov et al.2013] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

[Nießen and Ney2000] Sonja Nießen and Hermann Ney. 2000. Improving SMT quality with morpho-syntactic analysis. In Proceedings of the 18th International Conference on Computational Linguistics (COLING).

[Och and Ney2003] Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19-51.

[Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL).

[Popovic´ et al.2006] Maja Popovic´, Daniel Stein, and Hermann Ney. 2006. Statistical machine translation of German compound words. In Proceedings of FinTal - 5th International Conference on Natural Language Processing.

19 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue