Rank join queries in NoSQL databases

Article English OPEN
Ntarmos, N. ; Patlakas, I. ; Triantafillou, P. (2014)
  • Publisher: VLDB Endowment Inc.

Rank (i.e., top-k) join queries play a key role in modern analytics\ud tasks. However, despite their importance and unlike\ud centralized settings, they have been completely overlooked\ud in cloud NoSQL settings. We attempt to fill this gap: We\ud contribute a suite of solutions and study their performance\ud comprehensively. Baseline solutions are ordered using SQLlike\ud languages (like Hive and Pig), based on MapReduce\ud jobs. We first provide solutions that are based on specialized\ud indices, which may themselves be accessed using either\ud MapReduce or coordinator-based strategies. The first\ud index-based solution is based on inverted indices, which are\ud accessed with MapReduce jobs. The second index-based\ud solution adapts a popular centralized rank-join algorithm.\ud We further contribute a novel statistical structure comprising\ud histograms and Bloom filters, which forms the basis for\ud the third index-based solution. We provide (i) MapReduce\ud algorithms showing how to build these indices and statistical\ud structures, (ii) algorithms to allow for online updates to\ud these indices, and (iii) query processing algorithms utilizing\ud them. We implemented all algorithms in Hadoop (HDFS)\ud and HBase and tested them on TPC-H datasets of various\ud scales, utilizing different queries on tables of various sizes\ud and different score-attribute distributions. We ported our\ud implementations to Amazon EC2 and "in-house" lab clusters\ud of various scales. We provide performance results for\ud three metrics: query execution time, network bandwidth\ud consumption, and dollar-cost for query execution.
  • References (29)
    29 references, page 1 of 3

    [1] A. Abouzeid, et al. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB, 2(1):922{933, 2009.

    [2] F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In Proc. EDBT, 2010.

    [3] B. H. Bloom. Space/time trade-o s in hash coding with allowable errors. Commun. ACM, 13(7):422{426.

    [4] C. Bohm and H.-P. Kriegel. A cost model and index architecture for the similarity join. In Proc. ICDE, 2001.

    [5] P. Cao and Z. Wang. E cient top-k query calculation in distributed networks. In Proc. ACM PODC, 2004.

    [6] S. Cohen and Y. Matias. Spectral Bloom lters. In Proc. ACM SIGMOD, 2003.

    [7] J. Dittrich, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1-2):515{529, 2010.

    [8] C. Doulkeridis, et al. Processing of rank joins in highly distributed systems. In IEEE ICDE, 2012.

    [9] DynamoDB pricing scheme: http://aws.amazon.com/dynamodb/#pricing.

    [10] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proc. ACM PODS, 2001.

  • Metrics
    No metrics available
Share - Bookmark