PatCluster: A Top-Down Log Parsing Method Based on Frequent Words

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 01 Jan 2023 English Publisher:Elsevier BVJournal:SSRN Electronic Journal (eissn: 1556-5068,

Copyright policy )

Authors: Yu Bai; Yongwei Chi; Dan Zhao;

doi: 10.2139/ssrn.5144093 , 10.1109/access.2023.3239012 , 10.60692/et4kt-h6883 , 10.60692/m6sya-z4n83

PatCluster: A Top-Down Log Parsing Method Based on Frequent Words

- Summary
- Subjects
- Metrics

Abstract

Les journaux sont une combinaison de champs de type de message statique et de champs de variables dynamiques, et la précision de l'analyse des journaux affecte le résultat des tâches d'analyse des journaux ultérieures. À cet égard, une méthode d'analyse des journaux hors ligne basée sur des mots fréquents est introduite : PatCluster. Ce procédé génère d'abord des nœuds racines par prétraitement ; deuxièmement, la fréquence des mots est comptée, et le mot avec la plus grande fréquence est extrait en tant que condition de segmentation pour affiner le modèle généré par le nœud racine. Ainsi, de manière récursive, des nœuds de modèle sont formés pour tous les éléments des nœuds, et des modèles correspondants sont générés pour finalement atteindre le but de l'exploration de modèle de journal. Le processus d'extraction des motifs de bûches va de grossier à fin, ce qui est basé sur moins d'hypothèses, et la profondeur d'ajustement des motifs peut être contrôlée en ajustant la condition de terminaison. Dans le modèle d'algorithme optimisé, nous considérons également l'étendue maximale du modèle de journal correspondant au jeton dans le message de journal. Les résultats expérimentaux montrent que cette méthode améliore efficacement la qualité de l'analyse des journaux et a une précision d'analyse des journaux plus élevée que les autres méthodes, et est plus appropriée pour la manipulation de journaux avec des structures complexes.

Los registros son una combinación de campos de tipo de mensaje estático y campos de variable dinámica, y la precisión del análisis de registros afecta el resultado de las tareas de análisis de registros posteriores. En este sentido, se introduce un método de análisis de registros sin conexión basado en palabras frecuentes: PatCluster. Este método primero genera nodos raíz mediante preprocesamiento; en segundo lugar, se cuenta la frecuencia de las palabras y se extrae la palabra con la mayor frecuencia como condición de segmentación para refinar la plantilla generada por el nodo raíz. Entonces, de forma recursiva, se forman nodos de patrones para todos los elementos de los nodos y se generan las plantillas correspondientes para finalmente lograr el propósito de la minería de patrones de registro. El proceso de extracción de los patrones de registro es de grueso a fino, lo que se basa en menos supuestos, y la profundidad de ajuste del patrón se puede controlar ajustando la condición de terminación. En el modelo de algoritmo optimizado, también consideramos la extensión máxima de la plantilla de registro que coincide con el token en el mensaje de registro. Los resultados experimentales muestran que este método mejora efectivamente la calidad del análisis de registros y tiene una mayor precisión de análisis de registros que otros métodos, y es más adecuado para el manejo de registros con estructuras complejas.

Logs are a combination of static message type fields and dynamic variable fields, and the accuracy of log parsing affects the result of subsequent log analysis tasks. In this regard, an offline log parsing method based on frequent words is introduced: PatCluster. This method first generates root nodes by preprocessing; secondly, the frequency of words is counted, and the word with the largest frequency is extracted as the segmentation condition to refine the template generated by the root node. So on recursively, pattern nodes are formed for all elements of the nodes, and corresponding templates are generated to finally achieve the purpose of log pattern mining. The mining process of the log patterns is from coarse to fine which is based on fewer assumptions, and the pattern fitting depth can be controlled by adjusting the termination condition. In optimized algorithm model, we also consider the maximum extent of the log template matching the token in the log message. The experimental results show that this method effectively improves the log parsing quality and has higher log parsing accuracy than other methods, and is more suitable for handling logs with complex structures.

السجلات هي مزيج من حقول نوع الرسالة الثابتة وحقول المتغيرات الديناميكية، وتؤثر دقة تحليل السجل على نتيجة مهام تحليل السجل اللاحقة. في هذا الصدد، يتم تقديم طريقة تحليل السجل دون اتصال بالإنترنت بناءً على الكلمات المتكررة: PatCluster. تقوم هذه الطريقة أولاً بإنشاء العقد الجذرية عن طريق المعالجة المسبقة ؛ ثانيًا، يتم حساب تكرار الكلمات، ويتم استخراج الكلمة ذات التردد الأكبر كشرط التجزئة لتنقيح القالب الذي تم إنشاؤه بواسطة العقدة الجذرية. لذلك بشكل متكرر، يتم تشكيل عقد الأنماط لجميع عناصر العقد، ويتم إنشاء القوالب المقابلة لتحقيق الغرض من التنقيب عن أنماط السجل في النهاية. عملية تعدين أنماط السجل من الخشنة إلى الدقيقة التي تستند إلى افتراضات أقل، ويمكن التحكم في عمق تركيب النمط عن طريق ضبط حالة الإنهاء. في نموذج الخوارزمية المحسّن، نأخذ في الاعتبار أيضًا الحد الأقصى لنموذج السجل الذي يطابق الرمز المميز في رسالة السجل. تُظهر النتائج التجريبية أن هذه الطريقة تعمل بشكل فعال على تحسين جودة تحليل السجل ولديها دقة تحليل سجل أعلى من الطرق الأخرى، وهي أكثر ملاءمة للتعامل مع السجلات ذات الهياكل المعقدة.

Related Organizations

Zhejiang Ocean University
China (People's Republic of)
Hebei University
China (People's Republic of)
PEKING UNIVERSITY
China (People's Republic of)
PEKING UNIVERSITY
China (People's Republic of)
Peking University
China (People's Republic of)

View all View all

Keywords

FOS: Computer and information sciences, Sequential Patterns, Artificial intelligence, Pattern recognition (psychology), Segmentation, Engineering, Computer security, offline algorithm, PatCluster, Pattern matching, Automated Software Testing Techniques, Statistics, FOS: Philosophy, ethics and religion, Algorithm, Frequent Patterns, frequent words, Log Analysis and System Performance Diagnosis, Physical Sciences, Matching (statistics), Electrical engineering. Electronics. Nuclear engineering, Information Systems, Text segmentation, System Logs, Computer Networks and Communications, Word (group theory), Geometry, Structural engineering, Node (physics), Binary logarithm, Mathematical analysis, Log parsing, Data Mining Techniques and Applications, Log-log plot, Root (linguistics), FOS: Mathematics, Data mining, Preprocessor, Log Analysis, Parsing, Linguistics, Computer science, TK1-9971, Process (computing), Philosophy, Operating system, Security token, Computer Science, FOS: Languages and literature, Software, Mathematics

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Average

gold

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Related to Research communities

Knowmad Institut