Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?

Preprint English OPEN
Zhang, Xiang; LeCun, Yann;
  • Subject: Computer Science - Computation and Language | Computer Science - Learning

This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized wo... View more
Share - Bookmark