
doi: 10.7488/era/6544
This thesis aims to unify heterogeneous data anagement with a revised relational model for uniformly querying and updating across data in different formats without performance degradation. To address this well-recognized crucial challenge for extracting value from the variety of big data, the discussion begins with a prior philosophical reflection, though its necessity is often overlooked, on this variety itself: the existence of numerous incompatible data models. We clarify that various data models are all essentially cognitions of the task of database management, artificially developed under different considerations but all with the identical goal of making raw data manageable in the same physical world, rather than Cartesian mirror-images of distinct objective realities. The typical ignorance of this neglects the identity behind the opposites among the structures and operations they impose, thereby undermining the ideal of developing a general method for the task of database management. Regaining this long-dusted ideal, we recognize a pathway to achieving it through an investigation into the distinctiveness of the relational model that has enabled it to dominate the field of database management for decades. We show that the key characteristics of the relational model can directly address the model-independent challenges inherent in this task, thereby enabling the systems developed accordingly to perform better, which thus makes them indispensable for any data model to succeed in practice and therefore renders them also “relational”. This is not meant to establish the relational model as a “codex”, but rather to shed light on the principle of evolving it to catch up with the continuously progressing task of database management, so that the overall landscape of this task, amidst endless emerging challenges, can still be addressed by a single model rather than a combination of heterogeneous ones. In light of this, as a preliminary step toward unified database management, we provide an overarching framework to uniformly adopt different standpoints in accomplishing this task without combining heterogeneous models. Specifically, surrounding a proposed RG (Relational Generative) model, we develop different components for the options within the following aspects to be configurable: the connections can be represented either explicitly or implicitly; the consistency can be annotated on different data items and operations, across distinct isolation levels; and the analysis of data can be programmed both in the declarative and imperative paradigms. Practitioners would then no longer need to turn to an alternative, purpose-built model or system to adopt a particular method for managing their data, at least within these aspects, thereby paving the way toward the unification of heterogeneous data management. More technically, in the first part, we introduce the RG model, an extension of the relational model with logic-level pointers to connect different tuples, to represent graphs in relations while preserving its topology. By incorporating graph exploration via logical-level pointers as an operator, we generalize the relational query evaluation workflow to provide an enlarged unified plan space for graph-relation hybrid queries. The optimal query plan can then be generated and executed on a typical relational database system seamlessly without the need for a customized execution engine, retaining the battle-hardened optimizations embedded in relational databases while enabling hybrid queries over graphs and relations. In the second part, still concentrating on the analytical task, we identify redundant computations in query evaluation and introduce a recursive execution paradigm to eliminate them accordingly. Namely, we show that different subqueries might be automorphisms of each other, causing redundant computations that can only be eliminated by sharing results among subqueries. This will thus introduce recursive data passing, which breaks the separate execution of each join, even if those queries can still be described without recursion. We theoretically prove that such a recursive execution paradigm can help query evaluation to reach a complexity lower bound in string pattern matching that was once unachievable. In the third part, we concentrate on transactional tasks. Building upon the approach above for unifying relations and graphs, we develop a way of running transactions across relations and graphs. In addition, we further propose a fine-grained isolation level to warrant data-driven isolation guarantees according to user-defined annotations in each transaction, reflecting the heterogeneity of the data. This allows data items touched by a single transaction to be treated separately and differently, protecting only the critical logic with relatively high isolation while avoiding unnecessary isolation guarantees, thereby improving transaction throughput without impairing the correctness of application logic during concurrent execution. Finally, as a practical and accessible implementation of the proposed mechanisms, WhiteDB has been developed to assess the benefits they offer when applied to real-world data. It allows the same piece of data to be viewed both as “data model” and “object” and also to be accessed using both declarative and the imperative programming paradigms. Such a Versatility enables it to function as both C++ library with a variety of frequently utilized functionalities as well as an embedding database, thereby supporting an integrated multi-paradigm data analysis.
heterogeneous data management, recursive query evaluation, analytical query evaluation, transactional query processing, database system
heterogeneous data management, recursive query evaluation, analytical query evaluation, transactional query processing, database system
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
