KR-BERT: A Small-Scale Korean-Specific Language Model

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2020Embargo end date: 01 Jan 2020Publisher:arXivJournal:CoRR, volume abs/2008.03979

Authors: Sangah Lee; Hansol Jang; Yunmee Baik; Suzi Park; Hyopil Shin;

doi: 10.48550/arxiv.2008.03979

arXiv: 2008.03979

KR-BERT: A Small-Scale Korean-Specific Language Model

- Summary
- Subjects
- Related research
  (13)
- Metrics

Abstract

Since the appearance of BERT, recent works including XLNet and RoBERTa utilize sentence embedding models pre-trained by large corpora and a large number of parameters. Because such models have large hardware and a huge amount of data, they take a long time to pre-train. Therefore it is important to attempt to make smaller models that perform comparatively. In this paper, we trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset. Since Korean is one of the morphologically rich languages with poor resources using non-Latin alphabets, it is also important to capture language-specific linguistic phenomena that the Multilingual BERT model missed. We tested several tokenizers including our BidirectionalWordPiece Tokenizer and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for our model. With those adjustments, our KR-BERT model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.

7 pages

Related Organizations

SEOUL NATIONAL UNIVERSITY
Seoul National University
Korea (Republic of)

Keywords

FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)

13 Research products, page 1 of 2

Born of Two Koreas, of Human Blood: Monstrosity and the Discourse of Humanity and Pacifism in the Film Bulgasari
2019IsAmongTopNSimilarDocuments
Customized multiplexing SNP panel for Korean-specific DNA phenotyping in forensic applications
2017IsAmongTopNSimilarDocuments
Innovation in manufacturing: teleservice as an environment-friendly manufacturing concept in Korea
2002IsAmongTopNSimilarDocuments
Identification of transposable element-mediated deletions in 27 Korean individuals based on whole genome sequencing data
2015IsAmongTopNSimilarDocuments
Development of specific SNP molecular marker from Thistle using DNA sequences of ITS region
2018IsAmongTopNSimilarDocuments
Korean-Specific Parameter Models for Calculating the Risk of Down Syndrome in the Second Trimester of Pregnancy
2011IsAmongTopNSimilarDocuments
Development of a Korean‐specific virtual population for physiologically based pharmacokinetic modelling and simulation
2019IsAmongTopNSimilarDocuments
KoBART software on GitHub
IsRelatedTo
finBERT software on GitHub
IsRelatedTo
bert-japanese software on GitHub
IsRelatedTo

chevron_left
1
2
chevron_right

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering