Rank join queries in NoSQL databases
- Publisher: VLDB Endowment Inc.
Rank (i.e., top-k) join queries play a key role in modern analytics\ud tasks. However, despite their importance and unlike\ud centralized settings, they have been completely overlooked\ud in cloud NoSQL settings. We attempt to fill this gap: We\ud contribute a suite of solutions and study their performance\ud comprehensively. Baseline solutions are ordered using SQLlike\ud languages (like Hive and Pig), based on MapReduce\ud jobs. We first provide solutions that are based on specialized\ud indices, which may themselves be accessed using either\ud MapReduce or coordinator-based strategies. The first\ud index-based solution is based on inverted indices, which are\ud accessed with MapReduce jobs. The second index-based\ud solution adapts a popular centralized rank-join algorithm.\ud We further contribute a novel statistical structure comprising\ud histograms and Bloom filters, which forms the basis for\ud the third index-based solution. We provide (i) MapReduce\ud algorithms showing how to build these indices and statistical\ud structures, (ii) algorithms to allow for online updates to\ud these indices, and (iii) query processing algorithms utilizing\ud them. We implemented all algorithms in Hadoop (HDFS)\ud and HBase and tested them on TPC-H datasets of various\ud scales, utilizing different queries on tables of various sizes\ud and different score-attribute distributions. We ported our\ud implementations to Amazon EC2 and "in-house" lab clusters\ud of various scales. We provide performance results for\ud three metrics: query execution time, network bandwidth\ud consumption, and dollar-cost for query execution.