
handle: 2117/340117
A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in workstations with limited memory, and store data in files. Thus, this approach neglects the benefits brought by over 40 years of R&D in the area of data engineering, i.e., advanced database technologies and data management techniques. In this paper, we advocate for a standardized data engineering approach for data science and we present a layered architecture for a data processing pipeline (DPP). This architecture provides a comprehensive conceptual view of DPPs, which next enables the semi-automation of the logical and physical designs of such DPPs. Peer Reviewed
Database management, Data processing pipeline, Data analytics, :Informàtica::Sistemes d'informació [Àrees temàtiques de la UPC], Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació, Mineria de dades, Bases de dades -- Gestió, Data engineering, Data mining, Data management, Data science
Database management, Data processing pipeline, Data analytics, :Informàtica::Sistemes d'informació [Àrees temàtiques de la UPC], Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació, Mineria de dades, Bases de dades -- Gestió, Data engineering, Data mining, Data management, Data science
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 8 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
