Downloads provided by UsageCounts
Abstract: Word embeddings have proven to be an effective method for capturing semantic relations among distinct terms within a large corpus. In this paper, we present a set of word embeddings learnt from three large Lebanese news archives, which collectively consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. To train the word embeddings, Google’s Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt. To evaluate the accuracy of the learnt word embeddings, a benchmark of analogy tasks was used. Folder Navigation: The two zipped folders are models and evaluations. The models folder contains three subdirectories: assafir_models, hayat_models, and nahar_models. Each directory is attributed to a news archives. The contentsof these directories are decade-level and archive-level Word2Vec (CBOW) models in the form of [min year]_[max year].model for each archive. For each model, there is an attributed [min year]_[max year].txt , which consists of the filenames of each transcribed document used to train that model, ending with a set of the years and the number count of documents used. The evaluations folder contains three xls files and three text files. Each of the xls files is a workbook containing various spreadsheet, each of the spreadsheets contains the evaluation of each model trained across all the relations of the benchmark file and a total accuracy. The spreadsheet names are also in the form of [min year]_[max year]. The three text files are the logger files generated when the evaluation was done. The text files are in the form of logger_[archive_name].txt
word embeddings, optical character recognition, lebanese news archives
word embeddings, optical character recognition, lebanese news archives
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
| views | 67 | |
| downloads | 27 |

Views provided by UsageCounts
Downloads provided by UsageCounts