
The Python data engineering ecosystem is undergoing a fundamental architectural transition. The era of monolithic,eager-execution DataFrames - where pandas loads the full dataset into memory before any computation occurs - isgiving way to a new generation of lazy, declarative, multi-engine frameworks designed for cloud-scale workloads.Daft and Ibis represent the two most architecturally significant entrants in this space as of early 2024: Daft as adistributed, Ray-native DataFrame library with deep Apache Arrow integration and native support for multimodaldata types, and Ibis as a portable SQL expression compiler that decouples the Python DataFrame API from any singleexecution engine. This paper delivers a comprehensive technical evaluation of both frameworks across six dimensions:lazy evaluation semantics and optimization opportunities, query pushdown mechanisms and their quantified impacton data scan reduction, multi-engine execution breadth and portability, developer ergonomics and API expressiveness,performance benchmarks across representative workload classes, and production readiness criteria for cloud-scaledeployments. We demonstrate that Daft's distributed execution model achieves 3.1x the throughput of PySpark forParquet-intensive workloads at 8 nodes while maintaining sub-2x memory overhead versus single-node pandas, andthat Ibis's query pushdown reduces row scan volume by 80-99% for filtered queries against partitioned columnarstores. Together, these frameworks represent a coherent vision for a Python-native data engineering stack thateliminates the forced migration to JVM-based tools as data volumes exceed single-machine capacity.
Daft, Ibis, DataFrame, lazy evaluation, query pushdown, multi-engine execution, distributed computing, Ray, DuckDB, Apache Arrow, PyArrow, cloud-scale analytics, Python data engineering
Daft, Ibis, DataFrame, lazy evaluation, query pushdown, multi-engine execution, distributed computing, Ray, DuckDB, Apache Arrow, PyArrow, cloud-scale analytics, Python data engineering
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
