Keyboard Layout Analysis: Creating the Corpus, Bigram Chains, and Shakespeare's Monkeys

The process to create a corpus suitable for evaluating computer keyboard layouts optimised for typing English and computer program code. After sourcing, sampling and cleaning suitable texts, the texts are processed to extract bigrams, which are then used to create sample input texts of a desired length. These texts have a character distribution, and letter sequence, closely matching either English or computer programs, even though they look random. The resulting texts are excellent for evaluating keyboard layouts. Corpus analysis is included. p { margin-bottom: 0.25cm; line-height: 115%; orphans: 0; widows: 0; background: transparent; page-break-before: auto }p.western { font-family: "Libertinus Math"; font-size: 12pt; font-weight: normal }a:visited { color: #800000; so-language: zxx; text-decoration: underline }a:link { color: #000080; so-language: zxx; text-decoration: underline }

Includes related data files, but not the actual corpora, due to avoiding any copyright issues. 28 March 2021 1.0.0 Initial version. 29 March 2021 1.0.1 Added 4,5,6,7,8,9-grams, made tables more compact. Added Appendix A. 12 September 2021 1.0.2 Changed .csv files in dataset to .txt and .ods versions for better spreadsheet compatibility. Included 7, 8, 9-grams in dataset.

Keywords

English text corpus, computer code corpus, English letter frequency, computer program character frequency, bigram frequency, letter follows letter probability, letter precedes letter probability, keyboard layout, keyboard layout evaluation.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average