Fast Heterogeneous Analytics in SmartDataLake

Complex analytical queries with multiple joins over vast amounts of heterogeneous data usually take a considerable amount of time to execute and prevent interactive analysis. In SmartDataLake (SDL), the EPFL team accelerates execution by taking advantage of the diverse hardware in modern servers. Unlike other database engines that parallelize execution exclusively for homogeneous multicore systems, EPFL opts for a hybrid CPU-GPU execution. For this purpose, EPFL has designed and implemented a system with a new exchange operator that encapsulates heterogeneous parallelism across both CPUs and GPUs. The exchange operator provides a common interface for all hardware devices and maximizes utilization by assigning tasks at runtime elastically. Additionally, unlike the traditional exchange which connects individual interpreted operators, the operator connects pipelines in a just-in-time (JIT) compiled execution environment. By tightly integrating with JIT compilation, it eliminates the overhead of interpretation and enables pipelining optimizations that JIT engines have pioneered.

Two recent developments relate to the system’s ability to scale-out and execute queries over data that does not reside in memory:

Scale-out. Existing distributed analytical engines cannot fully exploit the accelerator-level parallelism that is available across multiple interconnected CPU-GPU servers. To remedy this, in SDL, EPFL built a framework that utilizes each cluster node to the maximum and crosses the node boundaries only when it is beneficial. They use technologies such as RDMA and zero-copy in order to reduce data transfer costs further.

Storage management. On one hand, this system relies on in-memory execution to maximize performance. On the other hand, data does not always fit in memory in real-life workloads. Frequently dropping and loading entire datasets incurs high and unnecessary costs. To deal with this problem, EPFL devises a fine-grained tiering mechanism that takes advantage of the storage hierarchy in modern servers. The storage manager allows for pluggable tiering policies that adhere to a common interface and minimizes data accesses and transfers.

Stay tuned for more information and for the first open-source release in a few months!