Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ http://cyberleninka....arrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Автоматическая классификация слабоструктурированных документов, участвующих в научно-образовательном процессе

Автоматическая классификация слабоструктурированных документов, участвующих в научно-образовательном процессе

Abstract

Ежедневно в научно-образовательном процессе любого учебного учреждения используется множество слабоструктурированных документов. Одним из подходов, позволяющих единообразно обрабатывать такие документы, является работа не с самими документами, а с их метаданными. Однако эффективность такого подхода в случае большого числа слабоструктурированных документов может быть достигнута лишь при наличии эффективного, с точки зрения использования вычислительных ресурсов, механизма автоматического извлечения метаданных из содержимого документов, который можно разбить на три этапа: определение класса документа; кластеризация документов, класс которых не удалось определить; извлечение метаданных из документа уже известного класса. Данная работа посвящена поиску возможных решений на первом этапе автоматической классификации слабоструктурированных документов. В работе введено понятие слабоструктурированного документа, представлены критерии эффективности методов классификации, проведен сравнительный анализ методов в соответствии с первыми пятью критериями. Для оценки по дополнительно разработанным двум критериям были реализованы методы: многослойные нейронные сети, Роккио, k-ближайших соседей. Результаты проведенного анализа показали, что наибольшую эффективность при решении данной задачи с точки зрения соотношения точность/скорость показывают нейронные сети, но точность классификации на слабоструктурированных документах не является достаточной. Выдвинута гипотеза, что точность методов можно повысить, используя при классификации не только ключевые слова, но и известную структуру документа.

Numerous semi-structured documents are used daily in education and research activities at universities. Dealing with metadata rather than documents themselves is one of the ways of processing documents uniformly. However, as far as many semi-structured documents are concerned, this method is considered to be efficient only in case of the existing procedure of automatic extraction of documents content metadata. The procedure includes 3 stages: document class identification, clusterization of the documents whose classes could not be identified, extraction of metadata from the documents of identified classes. The paper is dedicated to possible solutions for the first stage, i.e. automatic classification of semi-structured documents. The paper includes the definition of a semi-structured document, criteria of methods efficiency classification, comparative analysis of different methods regarding 5 top criteria. To estimate 2 additionally developped criteria the following methods are used: multilayer neural networks, Rocchio algorithm, k-nearest neighbor method. Based on the analysis results, the neural networks method appears to be the most efficient in the context of accuracy and speed correlation. However, classification accuracy is not enough when dealing with semi-structured documents. The authors suppose the accuracy of the methods can be improved by using not only key words but also determined document structure during classification process.

Keywords

СЛАБОСТРУКТУРИРОВАННЫЕ ДОКУМЕНТЫ, МЕТАДАННЫЕ, АВТОМАТИЧЕСКОЕ ИЗВЛЕЧЕНИЕ, АВТОМАТИЧЕСКАЯ

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average