Actions
  • shareshare
  • link
  • cite
  • add
add
auto_awesome_motion View all 5 versions
Publication . Doctoral thesis . 2019

Mining complex data and biclustering using formal concept analysis

Juniarta, Nyoman;
English
Published: 18 Dec 2019
Publisher: HAL CCSD
Country: France
Abstract

Knowledge discovery in database (KDD) is a process which is applied to possibly large volumes of data for discovering patterns which can be significant and useful. In this thesis, we are interested in data transformation and data mining in knowledge discovery applied to complex data, and we present several experiments related to different approaches and different data types.The first part of this thesis focuses on the task of biclustering using formal concept analysis (FCA) and pattern structures. FCA is naturally related to biclustering, where the objective is to simultaneously group rows and columns which verify some regularities. Related to FCA, pattern structures are its generalizations which work on more complex data. Partition pattern structures were proposed to discover constant-column biclustering, while interval pattern structures were studied in similar-column biclustering. Here we extend these approaches to enumerate other types of biclusters: additive, multiplicative, order-preserving, and coherent-sign-changes.The second part of this thesis focuses on two experiments in mining complex data. First, we present a contribution related to the CrossCult project, where we analyze a dataset of visitor trajectories in a museum. We apply sequence clustering and FCA-based sequential pattern mining to discover patterns in the dataset and to classify these trajectories. This analysis can be used within CrossCult project to build recommendation systems for future visitors. Second, we present our work related to the task of antibacterial drug discovery. The dataset for this task is generally a numerical matrix with molecules as rows and features/attributes as columns. The huge number of features makes it more complex for any classifier to perform molecule classification. Here we study a feature selection approach based on log-linear analysis which discovers associations among features.As a synthesis, this thesis presents a series of different experiments in the mining of complex real-world data.; L'extraction de connaissances dans les bases de données (ECBD) est un processus qui s'applique à de (potentiellement larges) volumes de données pour découvrir des motifs qui peuvent être signifiants et utiles. Dans cette thèse, on s'intéresse à deux étapes du processus d'ECBD, la transformation et la fouille, que nous appliquons à des données complexes. Nous présentons de nombreuses expérimentations s'appuyant sur des approches et des types de données variés.La première partie de cette thèse s'intéresse à la tâche de biclustering en s'appuyant sur l'analyse formelle de concepts (FCA) et aux pattern structures. FCA est naturellement liées au biclustering, dont l'objectif consiste à grouper simultanément un ensemble de lignes et de colonnes qui vérifient certaines régularités. Les pattern structures sont une généralisation de la FCA qui permet de travailler avec des données plus complexes. Les "partition pattern structures'' ont été proposées pour du biclustering à colonnes constantes tandis que les "interval pattern structures'' ont été étudiées pour du biclustering à colonnes similaires. Nous proposons ici d'étendre ces approches afin d'énumérer d'autres types de biclusters : additif, multiplicatif, préservant l'ordre, et changement de signes cohérents.Dans la seconde partie, nous nous intéressons à deux expériences de fouille de données complexes. Premièrement, nous présentons une contribution dans la quelle nous analysons les trajectoires des visiteurs d'un musée dans le cadre du projet CrossCult. Nous utilisons du clustering de séquences et de la fouille de motifs séquentiels basée sur l'analyse formelle de concepts pour découvrir des motifs dans les données et classifier les trajectoires. Cette analyse peut ensuite être exploitée par un système de recommandation pour les futurs visiteurs. Deuxièmement, nous présentons un travail sur la découverte de médicaments antibactériens. Les jeux de données pour cette tâche, généralement des matrices numériques, décrivent des molécules par un certain nombre de variables/attributs. Le grand nombre de variables complexifie la classification des molécules par les classifieurs. Ici, nous étudions une approche de sélection de variables basée sur l'analyse log-linéaire qui découvre des associations entre variables.En somme, cette thèse présente différentes expériences de fouille de données réelles et complexes.

Subjects

sequential pattern mining, biclustering, feature selection, formal concept analysis, analyse de concepts formels, extraction de motifs séquentiels, sélection d'attribut, [INFO]Computer Science [cs]

103 G and [1] Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proc.

20th int. conf. very large data bases, VLDB. vol. 1215, pp. 487499 (1994) [2] Ailem, M., Role, F., Nadif, M.: Graph modularity maximization as an eective method for co-clustering text data. Knowledge-Based Systems 109, 160173 (2016) [3] Alam, M., Buzmakov, A., Napoli, A.: Exploratory knowledge discovery over web of data.

213218. IOS Press (2014) [21] Codocedo-Henrquez, V.: Contributions l'indexation et la rØcupØration d'information utilisant l'analyse formelle de concepts. Ph.D. thesis, UniversitØ de Lorraine (2015) [22] Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273297 (1995) [23] Couceiro, M., Napoli, A.: Elements about exploratory, knowledge-based, hybrid, and explainable knowledge discovery. In: International Conference on Formal Concept Analysis.

pp. 316. Springer (2019) [24] Davoudi, A., Ghidary, S.S., Sadatnejad, K.: Dimensionality reduction based on distance preservation to local mean for symmetric positive denite matrices and its application in braincomputer interfaces. Journal of Neural Engineering 14(3), 036019 (2017) [25] Di-Jorio, L., Laurent, A., Teisseire, M.: Mining frequent gradual itemsets from large databases. In: International Symposium on Intelligent Data Analysis. pp. 297308. Springer (2009) [26] Ding, C., He, X., Simon, H.D.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the 2005 SIAM International Conference on Data Mining. pp. 606610. SIAM (2005) [27] Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 126135. ACM (2006) [29] Egho, E., Rassi, C., Calders, T., Jay, N., Napoli, A.: On measuring similarity for sequences of itemsets. Data Mining and Knowledge Discovery 29(3), 732764 (May 2015).

BMC Bioinformatics 15(1), 130 (2014) [45] Henriques, R., Madeira, S.C.: BiC2PAM: constraint-guided biclustering for biological data analysis with domain knowledge. Algorithms for Molecular Biology 11(1), 23 (2016) [46] Henriques, R., Madeira, S.C.: BicNET: Flexible module discovery in large-scale biological networks using biclustering. Algorithms for Molecular Biology 11(1), 14 (2016) [47] Henriques, R., Madeira, S.C., Antunes, C.: F2G: Ecient discovery of full-patterns.

ECML/PKDD nfMCP pp. 19 (2013) [48] Hochreiter, S., Bodenhofer, U., Heusel, M., Mayr, A., Mitterecker, A., Kasim, A., Khamiakova, T., Van Sanden, S., Lin, D., Talloen, W., et al.: FABIA: Factor analysis for bicluster acquisition. Bioinformatics 26(12), 15201527 (2010) [49] Hung, J.: An experiment about the classication of antibacterial molecules. Tech. rep., Orpailleur team, LORIA/Inria Nancy-Grand Est (2015) [50] Hussain, S.F., Ramazan, M.: Biclustering of human cancer microarray data using cosimilarity based co-clustering. Expert Systems with Applications 55, 520531 (2016) [51] Ignatov, D.I., Kuznetsov, S.O., Poelmans, J.: Concept-based biclustering for internet advertisement. In: Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on. pp. 123130. IEEE (2012) [52] Ignatov, D.I., Poelmans, J., Zaharchuk, V.: Recommender system based on algorithm of bicluster analysis RecBi. arXiv preprint arXiv:1202.2892 (2012) [53] Ignatov, D.I., Watson, B.W.: Towards a unied taxonomy of biclustering methods. arXiv preprint arXiv:1702.05376 (2017) [54] Ivanenkov, Y.A., Savchuk, N.P., Ekins, S., Balakin, K.V.: Computational mapping tools for drug discovery. Drug Discovery Today 14(15-16), 767775 (2009) [55] John, G.H., Kohavi, R., Peger, K.: Irrelevant features and the subset selection problem.

In: Machine Learning Proceedings 1994, pp. 121129. Elsevier (1994) [56] Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241254 (1967) [57] Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientic Computing 20(1), 359392 (1998) [62] Kuik, T., Boger, Z., Zancanaro, M.: Analysis and prediction of museum visitors' behavioral pattern types. In: Ubiquitous Display Environments, pp. 161176. Springer (2012) [77] Petitjean, F., Webb, G.I., Nicholson, A.E.: Scaling log-linear analysis to high-dimensional data. In: 2013 IEEE International Conference on Data Mining. pp. 597606. IEEE (2013) [79] Pio, G., Ceci, M., Loglisci, C., D'Elia, D., Malerba, D.: Hierarchical and overlapping co-clustering of mrna: mirna interactions. In: ECAI. pp. 654659. Citeseer (2012) [80] Pio, G., Ceci, M., Malerba, D., D'Elia, D.: ComiRNet: a web-based system for the analysis of miRNA-gene regulatory networks. BMC Bioinformatics 16(9), S7 (2015) [81] Pontes, B., GirÆldez, R., Aguilar-Ruiz, J.S.: Biclustering on expression data: A review.

Journal of biomedical informatics 57, 163180 (2015) [82] Prescott, A.M., Abel, S.M.: Combining in silico evolution and nonlinear dimensionality reduction to redesign responses of signaling networks. Physical Biology 13(6), 066015 (2017) [92] Tang, J., Alelyani, S., Liu, H.: Feature selection for classication: A review. Data Classication: Algorithms and Applications p. 37 (2014) [93] Todeschini, R., Consonni, V.: Molecular Descriptors for Chemoinformatics, vol. 41. John Wiley & Sons (2009) [94] Veroneze, R., Banerjee, A., Von Zuben, F.J.: Enumerating all maximal biclusters in numerical datasets. Information Sciences 379, 288309 (2017) [95] Vichi, M.: Double k-means clustering for simultaneous classication of objects and variables. In: Advances in Classication and Data Analysis, pp. 4352. Springer (2001) [96] VØron, E., Levasseur, M.: Ethnographie de l'exposition. BibliothŁque Publique d'Information, Centre Georges Pompidou, Paris (1983) [97] Webb, G.I.: Layered critical values: a powerful direct-adjustment approach to discovering signicant patterns. Machine Learning 71(2-3), 307323 (2008)

Funded by
EC| CROSSCULT
Project
CROSSCULT
CrossCult: Empowering reuse of digital cultural heritage in context-aware crosscuts of European history
  • Funder: European Commission (EC)
  • Project Code: 693150
  • Funding stream: H2020 | IA
Related to Research communities
moresidebar