Data Engineering for HPC with Python

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Other literature type 01 Nov 2020Embargo end date: 01 Jan 2020Publisher:IEEEJournal:2020 IEEE/ACM 9th Workshop on Python for High-Performance and Scientific Computing (PyHPC)Funded by:NSF | CIF21 DIBBs: Middleware a..., NSF | Network for Computational..., NSF | Collaborative Research: F...

Authors: Abeykoon, Vibhatha; Perera, Niranda; Widanage, Chathura; Kamburugamuve, Supun; Kanewala, Thejaka Amila; Maithree, Hasara; Wickramasinghe, Pulasthi; +2 Authors

doi: 10.1109/pyhpc51966.2020.00007 , 10.48550/arxiv.2010.06312

arXiv: 2010.06312

Data Engineering for HPC with Python

- Summary
- Subjects
- Related research
  (3)
- Metrics

Abstract

Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.

9 pages, 11 images, Accepted in 9th Workshop on Python for High-Performance and Scientific Computing (In conjunction with Supercomputing 20)

Related Organizations

Stanford University
United States
University of Moratuwa
Sri Lanka
Indiana University
United States
DePaul University
United States

Keywords

Performance (cs.PF), Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Computers and Society, Computer Science - Software Engineering, Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing, Computers and Society (cs.CY), Distributed, Parallel, and Cluster Computing (cs.DC)

3 Research products, page 1 of 1

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%