publication . Preprint . 2020

Storyboard: Optimizing Precomputed Summaries for Aggregation

Gan, Edward; Bailis, Peter; Charikar, Moses;
Open Access English
  • Published: 07 Feb 2020
Abstract
An emerging class of data systems partition their data and precompute approximate summaries (i.e., sketches and samples) for each segment to reduce query costs. They can then aggregate and combine the segment summaries to estimate results without scanning the raw data. However, given limited storage space each summary introduces approximation errors that affect query accuracy. For instance, systems that use existing mergeable summaries cannot reduce query error below the error of an individual precomputed summary. We introduce Storyboard, a query system that optimizes item frequency and quantile summaries for accuracy when aggregating over multiple segments. Com...
Subjects
free text keywords: Computer Science - Databases
Download from
50 references, page 1 of 4

[1] 2020. DataSketches Quantiles Sketch module. https://druid.apache.org/ docs/latest/development/extensions- core/datasketches- quantiles. html.

[2] Firas Abuzaid, Peter Bailis, Jialin Ding, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2018. MacroBase: Prioritizing Attention in Fast Data. ACM Trans. Database Syst. 43, 4, Article 15 (2018), 45 pages.

[3] Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, Jef Naughton, Peter Bailis, and Matei Zaharia. 2018. DIFF: A Relational Interface for Large-scale Data Explanation. PVLDB 12, 4 (2018), 419- 432.

[4] Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. The Aqua Approximate Query Answering System. In SIGMOD. 574-576. [OpenAIRE]

[5] Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jef Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable Summaries. In PODS.

[6] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys. 29-42.

[7] Arvind Arasu and Gurmeet Singh Manku. 2004. Approximate Counts and Quantiles over Sliding Windows. In PODS. 286-296. [OpenAIRE]

[8] Ran Ben Basat, Roy Friedman, and Rana Shahout. 2018. Stream Frequency over Interval Queries. PVLDB 12, 4 (2018), 433-445.

[9] Mihai Budiu, Parikshit Gopalan, Lalith Suresh, Udi Wieder, Han Kruiger, and Marcos K. Aguilera. 2019. Hillview: A Trillion-cell Spreadsheet for Big Data. PVLDB 12, 11 (July 2019), 1442-1457.

[10] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM J. Sci. Comput. 16, 5 (1995), 1190-1208.

[11] CAIDA. 2016. The CAIDA UCSD Anonymized Internet Traces. http: //www.caida.org/data/passive/passive_dataset.xml http://www.caida. org/data/passive/passive_dataset.xml.

[12] Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized Stratified Sampling for Approximate Query Processing. ACM Trans. Database Syst. 32, 2 (2007). [OpenAIRE]

[13] Bernard Chazelle. 2000. The Discrepancy Method: Randomness and Complexity. Cambridge University Press, New York, NY, USA. [OpenAIRE]

[14] Edith Cohen, Graham Cormode, and Nick G. Dufield. 2011. StructureAware Sampling: Flexible and Accurate Summarization. PVLDB 4, 11 (2011), 819-830.

[15] Graham Cormode, Minos Garofalakis, Peter J Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1-3 (2012), 1-294.

50 references, page 1 of 4
Abstract
An emerging class of data systems partition their data and precompute approximate summaries (i.e., sketches and samples) for each segment to reduce query costs. They can then aggregate and combine the segment summaries to estimate results without scanning the raw data. However, given limited storage space each summary introduces approximation errors that affect query accuracy. For instance, systems that use existing mergeable summaries cannot reduce query error below the error of an individual precomputed summary. We introduce Storyboard, a query system that optimizes item frequency and quantile summaries for accuracy when aggregating over multiple segments. Com...
Subjects
free text keywords: Computer Science - Databases
Download from
50 references, page 1 of 4

[1] 2020. DataSketches Quantiles Sketch module. https://druid.apache.org/ docs/latest/development/extensions- core/datasketches- quantiles. html.

[2] Firas Abuzaid, Peter Bailis, Jialin Ding, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2018. MacroBase: Prioritizing Attention in Fast Data. ACM Trans. Database Syst. 43, 4, Article 15 (2018), 45 pages.

[3] Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, Jef Naughton, Peter Bailis, and Matei Zaharia. 2018. DIFF: A Relational Interface for Large-scale Data Explanation. PVLDB 12, 4 (2018), 419- 432.

[4] Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. The Aqua Approximate Query Answering System. In SIGMOD. 574-576. [OpenAIRE]

[5] Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jef Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable Summaries. In PODS.

[6] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys. 29-42.

[7] Arvind Arasu and Gurmeet Singh Manku. 2004. Approximate Counts and Quantiles over Sliding Windows. In PODS. 286-296. [OpenAIRE]

[8] Ran Ben Basat, Roy Friedman, and Rana Shahout. 2018. Stream Frequency over Interval Queries. PVLDB 12, 4 (2018), 433-445.

[9] Mihai Budiu, Parikshit Gopalan, Lalith Suresh, Udi Wieder, Han Kruiger, and Marcos K. Aguilera. 2019. Hillview: A Trillion-cell Spreadsheet for Big Data. PVLDB 12, 11 (July 2019), 1442-1457.

[10] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM J. Sci. Comput. 16, 5 (1995), 1190-1208.

[11] CAIDA. 2016. The CAIDA UCSD Anonymized Internet Traces. http: //www.caida.org/data/passive/passive_dataset.xml http://www.caida. org/data/passive/passive_dataset.xml.

[12] Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized Stratified Sampling for Approximate Query Processing. ACM Trans. Database Syst. 32, 2 (2007). [OpenAIRE]

[13] Bernard Chazelle. 2000. The Discrepancy Method: Randomness and Complexity. Cambridge University Press, New York, NY, USA. [OpenAIRE]

[14] Edith Cohen, Graham Cormode, and Nick G. Dufield. 2011. StructureAware Sampling: Flexible and Accurate Summarization. PVLDB 4, 11 (2011), 819-830.

[15] Graham Cormode, Minos Garofalakis, Peter J Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1-3 (2012), 1-294.

50 references, page 1 of 4
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue