publication . Conference object . Other literature type . Preprint . 2018

Pilot-Streaming: A Stream Processing Framework for High-Performance Computing

Luckow, Andre; Chantzialexiou, George; Jha, Shantenu;
Open Access
  • Published: 01 Oct 2018
  • Publisher: IEEE
Abstract
An increasing number of scientific applications rely on stream processing for generating timely insights from data feeds of scientific instruments, simulations, and Internet-of-Thing (IoT) sensors. The development of streaming applications is a complex task and requires the integration of heterogeneous, distributed infrastructure, frameworks, middleware and application components. Different application components are often written in different languages using different abstractions and frameworks. Often, additional components, such as a message broker (e.g. Kafka), are required to decouple data production and consumptions and avoiding issues, such as back-pressu...
Subjects
free text keywords: Spark (mathematics), Distributed computing, Data modeling, Test data generation, Stream processing, Data processing, Computer science, Middleware, Resource management, Use case, Computer Science - Distributed, Parallel, and Cluster Computing
Funded by
NSF| CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1443054
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Advanced Cyberinfrastructure
,
NSF| SI2-SSE: RADICAL Cybertools: Scalable, Interoperable and Sustainable Tools for Science
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1440677
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Advanced Cyberinfrastructure
55 references, page 1 of 4

[1] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-ofOrder Data Processing. Proceedings of the VLDB Endowment 8 (2015), 1792-1803.

[2] Amazon 2017. Amazon Kinesis. https://aws.amazon.com/kinesis/. (2017).

[3] Amedeo Perazzo. 2016. LCLS Data Analysis Strategy. https://portal.slac.stanford. edu/sites/lcls_public/Documents/LCLSDataAnalysisStrategy.pdf

[4] Apache 2018. Apache Beam. https://beam.apache.org/. (2018).

[5] Apache Flink 2015. Apache Flink. https://flink.apache.org/. (2015).

[6] Vivekanandan Balasubramanian, Iain Bethune, Ardita Shkurti, Elena Breitmoser, Eugen Hruska, Cecilia Clementi, Charles Laughton, and Shantenu Jha. 2016. ExTASY: Scalable and Flexible Coupling of MD Simulations and Advanced Sampling Techniques. In IEEE International Conference on eScience. http://arxiv.org/abs/1606.00093. [OpenAIRE]

[7] T. Bicer, D. Gursoy, R. Kettimuthu, I. T. Foster, B. Ren, V. De Andrede, and F. De Carlo. 2017. Real-Time Data Analysis and Autonomous Steering of Synchrotron Light Source Experiments. In 2017 IEEE 13th International Conference on e-Science (e-Science). 59-68. https://doi.org/10.1109/eScience.2017.53 [OpenAIRE]

[8] Brookhaven National Laboratory. 2017. National Synchrotron Light Source II. https://www.bnl.gov/ps/

[9] Nicholas Chaimov, Allen Malony, Shane Canon, Costin Iancu, Khaled Z. Ibrahim, and Jay Srinivasan. 2016. Scaling Spark on HPC Systems. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). ACM, New York, NY, USA, 97-110. https://doi.org/10. 1145/2907294.2907310 [OpenAIRE]

[10] Dask Development Team. 2016. Dask: Library for dynamic task scheduling. http: //dask.pydata.org

[11] Dennis Gannon 2016. Observations About Streaming Data Analytics for Science. 10.13140/RG.2.1.4189.5280. (2016).

[12] Betsy A. Dowd, Graham H. Campbell, Robert B. Marr, Vivek V. Nagarkar, Sameer V. Tipnisand Lisa Axe, and D. Peter Siddons. 1999. Developments in synchrotron x-ray computed microtomography at the National Synchrotron Light Source. Proc.SPIE 3772, 3772 - 3772 - 13. https://doi.org/10.1117/12.363725 [OpenAIRE]

[13] Y. Du, M. Chowdhury, M. Rahman, K. Dey, A. Apon, A. Luckow, and L. B. Ngo. 2017. A Distributed Message Delivery Infrastructure for Connected Vehicle Technology Applications. IEEE Transactions on Intelligent Transportation Systems PP, 99 (2017), 1-15. https://doi.org/10.1109/TITS.2017.2701799

[14] Wolfgang Eberhardt and Franz Himpsel. 2009. Next-Generation Photon Sources for Grand Challenges in Science and Energy. https://science.energy.gov/~/media/ bes/pdf/reports/files/Next-Generation_Photon_Sources_rpt.pdf. (2009).

[15] Ekasitk. 2016. Spark-on-HPC. https://github.com/ekasitk/spark-on-hpc

55 references, page 1 of 4
Abstract
An increasing number of scientific applications rely on stream processing for generating timely insights from data feeds of scientific instruments, simulations, and Internet-of-Thing (IoT) sensors. The development of streaming applications is a complex task and requires the integration of heterogeneous, distributed infrastructure, frameworks, middleware and application components. Different application components are often written in different languages using different abstractions and frameworks. Often, additional components, such as a message broker (e.g. Kafka), are required to decouple data production and consumptions and avoiding issues, such as back-pressu...
Subjects
free text keywords: Spark (mathematics), Distributed computing, Data modeling, Test data generation, Stream processing, Data processing, Computer science, Middleware, Resource management, Use case, Computer Science - Distributed, Parallel, and Cluster Computing
Funded by
NSF| CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1443054
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Advanced Cyberinfrastructure
,
NSF| SI2-SSE: RADICAL Cybertools: Scalable, Interoperable and Sustainable Tools for Science
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1440677
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Advanced Cyberinfrastructure
55 references, page 1 of 4

[1] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-ofOrder Data Processing. Proceedings of the VLDB Endowment 8 (2015), 1792-1803.

[2] Amazon 2017. Amazon Kinesis. https://aws.amazon.com/kinesis/. (2017).

[3] Amedeo Perazzo. 2016. LCLS Data Analysis Strategy. https://portal.slac.stanford. edu/sites/lcls_public/Documents/LCLSDataAnalysisStrategy.pdf

[4] Apache 2018. Apache Beam. https://beam.apache.org/. (2018).

[5] Apache Flink 2015. Apache Flink. https://flink.apache.org/. (2015).

[6] Vivekanandan Balasubramanian, Iain Bethune, Ardita Shkurti, Elena Breitmoser, Eugen Hruska, Cecilia Clementi, Charles Laughton, and Shantenu Jha. 2016. ExTASY: Scalable and Flexible Coupling of MD Simulations and Advanced Sampling Techniques. In IEEE International Conference on eScience. http://arxiv.org/abs/1606.00093. [OpenAIRE]

[7] T. Bicer, D. Gursoy, R. Kettimuthu, I. T. Foster, B. Ren, V. De Andrede, and F. De Carlo. 2017. Real-Time Data Analysis and Autonomous Steering of Synchrotron Light Source Experiments. In 2017 IEEE 13th International Conference on e-Science (e-Science). 59-68. https://doi.org/10.1109/eScience.2017.53 [OpenAIRE]

[8] Brookhaven National Laboratory. 2017. National Synchrotron Light Source II. https://www.bnl.gov/ps/

[9] Nicholas Chaimov, Allen Malony, Shane Canon, Costin Iancu, Khaled Z. Ibrahim, and Jay Srinivasan. 2016. Scaling Spark on HPC Systems. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). ACM, New York, NY, USA, 97-110. https://doi.org/10. 1145/2907294.2907310 [OpenAIRE]

[10] Dask Development Team. 2016. Dask: Library for dynamic task scheduling. http: //dask.pydata.org

[11] Dennis Gannon 2016. Observations About Streaming Data Analytics for Science. 10.13140/RG.2.1.4189.5280. (2016).

[12] Betsy A. Dowd, Graham H. Campbell, Robert B. Marr, Vivek V. Nagarkar, Sameer V. Tipnisand Lisa Axe, and D. Peter Siddons. 1999. Developments in synchrotron x-ray computed microtomography at the National Synchrotron Light Source. Proc.SPIE 3772, 3772 - 3772 - 13. https://doi.org/10.1117/12.363725 [OpenAIRE]

[13] Y. Du, M. Chowdhury, M. Rahman, K. Dey, A. Apon, A. Luckow, and L. B. Ngo. 2017. A Distributed Message Delivery Infrastructure for Connected Vehicle Technology Applications. IEEE Transactions on Intelligent Transportation Systems PP, 99 (2017), 1-15. https://doi.org/10.1109/TITS.2017.2701799

[14] Wolfgang Eberhardt and Franz Himpsel. 2009. Next-Generation Photon Sources for Grand Challenges in Science and Energy. https://science.energy.gov/~/media/ bes/pdf/reports/files/Next-Generation_Photon_Sources_rpt.pdf. (2009).

[15] Ekasitk. 2016. Spark-on-HPC. https://github.com/ekasitk/spark-on-hpc

55 references, page 1 of 4
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Conference object . Other literature type . Preprint . 2018

Pilot-Streaming: A Stream Processing Framework for High-Performance Computing

Luckow, Andre; Chantzialexiou, George; Jha, Shantenu;