publication . Other literature type . Preprint . Conference object . 2017

Character-based Neural Embeddings for Tweet Clustering.

Svitlana Vakulenko; Lyndon Nixon; Mihai Lupu;
  • Published: 03 Apr 2017
  • Publisher: Association for Computational Linguistics (ACL)
Abstract
In this paper we show how the performance of tweet clustering can be improved by leveraging character-based neural networks. The proposed approach overcomes the limitations related to the vocabulary explosion in the word-based models and allows for the seamless processing of the multilingual content. Our evaluation results and code are available on-line at https://github.com/vendi12/tweet2vec_clustering
Subjects
free text keywords: Story Detection, Tweet Clustering, Tweet2vec, Vector Space Model, Character-based Embedding, Computer Science - Information Retrieval, Computer Science - Computation and Language, Computer science, Pattern recognition, Cluster analysis, Artificial intelligence, business.industry, business
Funded by
EC| InVID
Project
InVID
In Video Veritas – Verification of Social Media Video Content for the News Industry
  • Funder: European Commission (EC)
  • Project Code: 687786
  • Funding stream: H2020 | IA
,
FWF| Abstracting Domain-Specific Information Retrieval and Evaluation (ADmIRE)
Project
  • Funder: Austrian Science Fund (FWF) (FWF)
  • Project Code: P 25905
  • Funding stream: Einzelprojekte
Download fromView all 5 versions
Zenodo
Other literature type . 2017
Provider: Datacite
ZENODO
Conference object . 2017
Provider: ZENODO
24 references, page 1 of 2

[Arbelaitz et al.2013] Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Jesu´s M. Pe´rez, and In˜igo Perona. 2013. An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243-256. [OpenAIRE]

[Brigadir et al.2014] Igor Brigadir, Derek Greene, and Padraig Cunningham. 2014. Adaptive Representations for Tracking Breaking News on Twitter. In NewsKDD - Workshop on Data Science for News Publishing at The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, August 24-27, 2014, New York, NY, USA.

[Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, C¸aglar Gu¨lc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, pages 1724-1734.

[Dhingra et al.2016] Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W. Cohen. 2016. Tweet2vec: Character-based distributed representations for social media. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany.

[dos Santos and Zadrozny2014] C´ıcero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, 21- 26 June, 2014, Beijing, China, pages 1818-1826.

[Hayashi et al.2015] Kohei Hayashi, Takanori Maehara, Masashi Toyoda, and Ken-ichi Kawarabayashi.

[Hochreiter and Schmidhuber1997] Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735-1780.

[Hubert and Arabie1985] Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of classification, 2(1):193-218.

[Ifrim et al.2014] Georgiana Ifrim, Bichen Shi, and Igor Brigadir. 2014. Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering. In Symeon Papadopoulos, David Corney, and Luca Maria Aiello, editors, Proceedings of the SNOW 2014 Data Challenge co-located with 23rd International World Wide Web Conference (WWW 2014), April 8, 2014, Seoul, Korea, pages 33-40.

[Kim et al.2016] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Characteraware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages 2741-2749.

[Levenshtein1966] Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707.

[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Christopher J. C. Burges, Lon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111-3119.

[Moran et al.2016] Sean Moran, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2016. Enhancing First Story Detection using Word Embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, July 17-21, 2016, Pisa, Italy, pages 821-824.

[Mu¨llner2013] Daniel Mu¨llner. 2013. fastcluster: Fast hierarchical, agglomerative clustering routines for r and python. Journal of Statistical Software, 53(1):1-18.

[Nguyen et al.2010] Xuan Vinh Nguyen, Julien Epps, and James Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11:2837-2854.

24 references, page 1 of 2
Abstract
In this paper we show how the performance of tweet clustering can be improved by leveraging character-based neural networks. The proposed approach overcomes the limitations related to the vocabulary explosion in the word-based models and allows for the seamless processing of the multilingual content. Our evaluation results and code are available on-line at https://github.com/vendi12/tweet2vec_clustering
Subjects
free text keywords: Story Detection, Tweet Clustering, Tweet2vec, Vector Space Model, Character-based Embedding, Computer Science - Information Retrieval, Computer Science - Computation and Language, Computer science, Pattern recognition, Cluster analysis, Artificial intelligence, business.industry, business
Funded by
EC| InVID
Project
InVID
In Video Veritas – Verification of Social Media Video Content for the News Industry
  • Funder: European Commission (EC)
  • Project Code: 687786
  • Funding stream: H2020 | IA
,
FWF| Abstracting Domain-Specific Information Retrieval and Evaluation (ADmIRE)
Project
  • Funder: Austrian Science Fund (FWF) (FWF)
  • Project Code: P 25905
  • Funding stream: Einzelprojekte
Download fromView all 5 versions
Zenodo
Other literature type . 2017
Provider: Datacite
ZENODO
Conference object . 2017
Provider: ZENODO
24 references, page 1 of 2

[Arbelaitz et al.2013] Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Jesu´s M. Pe´rez, and In˜igo Perona. 2013. An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243-256. [OpenAIRE]

[Brigadir et al.2014] Igor Brigadir, Derek Greene, and Padraig Cunningham. 2014. Adaptive Representations for Tracking Breaking News on Twitter. In NewsKDD - Workshop on Data Science for News Publishing at The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, August 24-27, 2014, New York, NY, USA.

[Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, C¸aglar Gu¨lc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, pages 1724-1734.

[Dhingra et al.2016] Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W. Cohen. 2016. Tweet2vec: Character-based distributed representations for social media. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany.

[dos Santos and Zadrozny2014] C´ıcero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, 21- 26 June, 2014, Beijing, China, pages 1818-1826.

[Hayashi et al.2015] Kohei Hayashi, Takanori Maehara, Masashi Toyoda, and Ken-ichi Kawarabayashi.

[Hochreiter and Schmidhuber1997] Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735-1780.

[Hubert and Arabie1985] Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of classification, 2(1):193-218.

[Ifrim et al.2014] Georgiana Ifrim, Bichen Shi, and Igor Brigadir. 2014. Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering. In Symeon Papadopoulos, David Corney, and Luca Maria Aiello, editors, Proceedings of the SNOW 2014 Data Challenge co-located with 23rd International World Wide Web Conference (WWW 2014), April 8, 2014, Seoul, Korea, pages 33-40.

[Kim et al.2016] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Characteraware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages 2741-2749.

[Levenshtein1966] Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707.

[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Christopher J. C. Burges, Lon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111-3119.

[Moran et al.2016] Sean Moran, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2016. Enhancing First Story Detection using Word Embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, July 17-21, 2016, Pisa, Italy, pages 821-824.

[Mu¨llner2013] Daniel Mu¨llner. 2013. fastcluster: Fast hierarchical, agglomerative clustering routines for r and python. Journal of Statistical Software, 53(1):1-18.

[Nguyen et al.2010] Xuan Vinh Nguyen, Julien Epps, and James Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11:2837-2854.

24 references, page 1 of 2
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Other literature type . Preprint . Conference object . 2017

Character-based Neural Embeddings for Tweet Clustering.

Svitlana Vakulenko; Lyndon Nixon; Mihai Lupu;