The Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on . In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A2DB. Our evaluation on various real-world and synthetic datasets shows that A2DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries.

Keywords

distributed computing, actor programming, theta-join, query optimization

11 Research products, page 1 of 2

Two MRJs for Multi-way Theta-Join in MapReduce
2013IsAmongTopNSimilarDocuments
The $$\theta $$-Join as a Join with $$\theta $$
2020IsAmongTopNSimilarDocuments
The Semijoin Algebra
2006IsAmongTopNSimilarDocuments
Conjunctive Queries with Theta Joins Under Updates
2019IsAmongTopNSimilarDocuments
Processing theta-joins using MapReduce
2011IsAmongTopNSimilarDocuments
Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates
2018IsAmongTopNSimilarDocuments
D-JB: An Online Join Method for Skewed and Varied Data Streams
2018IsAmongTopNSimilarDocuments
Toward fast theta‐join: A prefiltering and amalgamated partitioning approach
2022IsAmongTopNSimilarDocuments
An Efficient Theta-Join Query Processing Algorithm on MapReduce Framework
2012IsAmongTopNSimilarDocuments
A framework for query optimization in temporal databases
1990IsAmongTopNSimilarDocuments

chevron_left
1
2
chevron_right

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now