Clustering of Web Documents with Structure of Webpages based on the HTML Document Object Model

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 01 Apr 2019Publisher:IEEEJournal:2019 IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS)

Authors: Manoj Kumar Sarma; Anjana Kakoti Mahanta;

doi: 10.1109/incos45849.2019.8951405

Clustering of Web Documents with Structure of Webpages based on the HTML Document Object Model

- Summary
- Metrics

Abstract

Web mining is an emerging Data Mining arenathat usesvarious techniques to explore hidden patterns available within the WWW. Clustering has significant applications in Web mining, particularly in grouping Webpages based on their various properties. Literature suggests that clustering applied over Webpages is generally based on the contents of the availableWebpages, thereby focusing on text mining techniques only. But since unlike normal text documents Webpages are structured documents, there is a scope of exploring whether the structural properties of Webpages have any impact on their clustering. This paper aims to apply clustering on Web Documents based on DOM structure of Webpages, where the HTML-DOM structure of each Webpage has been represented as a string of characters, and then applying K-means clustering on the string representation. The same algorithm has been applied with four different distance measures on four different datasets. The clustering output in each case has been evaluated and the results have been compared.

Related Organizations

Gauhati University
India

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now