Powered by OpenAIRE graph
Found an issue? Give us feedback
Open Data LMUarrow_drop_down
Open Data LMU
Doctoral thesis . 2007
Data sources: Datacite
DBLP
Doctoral thesis
Data sources: DBLP
versions View all 2 versions
addClaim

Graph Kernels

Authors: Borgwardt, Karsten Michael;
Abstract

As new graph structured data is constantly being generated, learning and data mining on graphs have become a challenge in application areas such as molecular biology, telecommunications, chemoinformatics, and social network analysis. The central algorithmic problem in these areas, measuring similarity of graphs, has therefore received extensive attention in the recent past. Unfortunately, existing approaches are slow, lacking in expressivity, or hard to parameterize. Graph kernels have recently been proposed as a theoretically sound and promising approach to the problem of graph comparison. Their attractivity stems from the fact that by defining a kernel on graphs, a whole family of data mining and machine learning algorithms becomes applicable to graphs. These kernels on graphs must respect both the information represented by the topology and the node and edge labels of the graphs, while being efficient to compute. Existing methods fall woefully short; they miss out on important topological information, are plagued by runtime issues, and do not scale to large graphs. Hence the primary goal of this thesis is to make learning and data mining with graph kernels feasible. In the first half of this thesis, we review and analyze the shortcomings of state-of-the-art graph kernels. We then propose solutions to overcome these weaknesses. As highlights of our research, we - speed up the classic random walk graph kernel from O(n^6) to O(n^3), where n is the number of nodes in the larger graph, and by a factor of up to 1,000 in CPU runtime, by extending concepts from Linear Algebra to Reproducing Kernel Hilbert Spaces, - define novel graph kernels based on shortest paths that avoid tottering and outperform random walk kernels in accuracy, - define novel graph kernels that estimate the frequency of small subgraphs within a large graph and that work on large graphs hitherto not handled by existing graph kernels. In the second half of this thesis, we present algorithmic solutions to two novel problems in graph mining. First, we define a two-sample test on graphs. Given two sets of graphs, or a pair of graphs, this test lets us decide whether these graphs are likely to originate from the same underlying distribution. To solve this so-called two-sample-problem, we define the first kernel-based two-sample test. Combined with graph kernels, this results in the first two-sample test on graphs described in the literature. Second, we propose a principled approach to supervised feature selection on graphs. As in feature selection on vectors, feature selection on graphs aims at finding features that are correlated with the class membership of a graph. Towards this goal, we first define a family of supervised feature selection algorithms based on kernels and the Hilbert-Schmidt Independence Criterion. We then show how to extend this principle of feature selection to graphs, and how to combine it with gSpan, the state-of-the-art method for frequent subgraph mining. On several benchmark datasets, our novel procedure manages to select a small subset of dozens of informative features among thousands and millions of subgraphs detected by gSpan. In classification experiments, the features selected by our method outperform those chosen by other feature selectors in terms of classification accuracy. Along the way, we also solve several problems that can be deemed contributions in their own right: - We define a unifying framework for describing both variants of random walk graph kernels proposed in the literature. - We present the first theoretical connection between graph kernels and molecular descriptors from chemoinformatics. - We show how to determine sample sizes for estimating the frequency of certain subgraphs within a large graph with a given precision and confidence, which promises to be a key to the solution of important problems in data mining and bioinformatics. Three branches of computer science immediately benefit from our findings: data mining, machine learning, and bioinformatics. For data mining, our efficient graph kernels allow us to bring to bear the large family of kernel methods to mining problems on real-world graph data. For machine learning, we open the door to extend strong theoretical results on learning on graphs into useful practical applications. For bioinformatics, we make a number of principled kernel methods and efficient kernel functions available for biological network comparison, and structural comparisons of proteins. Apart from these three areas, other fields may also benefit from our findings, as our algorithms are general in nature and not restricted to a particular type of application.

Country
Germany
Keywords

ddc:510, ddc:500, Graph Kernels, Graph Comparison, Kernel Methods, Feature Selection, Two-Sample-Test, Informatik und Statistik, FOS: Mathematics, 500, Fakultät für Mathematik, 004, 510

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!