Proteus-RAW integration

Connecting two distinct systems is a legitimate approach when crafting a standalone software solution that would include features of both. But as handy standard APIs and serialization formats can be, the blending won’t systematically perform in a satisfying way. Integration often requires ad hoc redesign and refactoring, from both sides.

Integrating two complex systems can then take months of engineering time. It is precious that these efforts retroactively benefit both initial products.

Two standalone query engines compose SmartDataLake’s data virtualization layer: RAW and Proteus, respectively developed by RAW-Labs SA, and the DIAS laboratory in EPFL.

– In RAW’s engine, queries execute against external datasets of various formats, located in storage systems of various kinds (HDFS, S3, RDBMS, etc.), as if they were tables. Because it’s meant to query presumably raw and unstructured datasets, RAW’s query language RQL offers rich data processing and scripting capabilities.

– Proteus is an SQL engine bringing together an efficient plan optimizer and a sophisticated code generation logic, that makes it an extremely efficient execution system.

SmartDataLake’s data virtualization layer is implemented as the stacking of both systems: In a SmartDataLake Proteus session, tables may internally refer to RAW views, themselves known to RAW as RQL queries over datasets.

In the first part of the project, the focus has been on specific Proteus plan optimizations that permit its query planner to automatically offload certain algebra computation steps to RAW. This helps downsizing the data retrieved from RAW during query execution. Intermediate results of the subquery were however sent to Proteus through the standard solution in RAW: its REST API, where rows are serialized in JSON, in a single data stream.

In the second half of the project, EPFL developed an efficient storage system, which could eventually implement a better internal communication channel between RAW and Proteus.

The RAW Labs engineering team expected the integration of their engine with both Proteus and the new SmartDataLake storage system, would imply major development and testing efforts. Taking it as an opportunity to augment RAW’s query engine with significant capabilities was key. RAW Labs SA therefore enhanced its query engine with support for:

– Apache Arrow for serializing results. Compared to JSON, a binary columnar format effectively compresses resultsets sent to Proteus. Support for Arrow was also added to Proteus as part of SmartDataLake.

– Parallel writing to storage backends. Query results generally read through its existing REST API can be optionally written to a storage system instead. This happens in parallel from its computing cluster. Once SmartDataLake’s new storage service has received RAW’s results, Proteus can take the rows over, and complete its part of the plan. The storage service therefore plays the role of an efficient communication channel between RAW and Proteus.

By agreeing on a de-facto standard format (Arrow) and jointly designing interfaces such that SmartDataLake’s storage system could conveniently be abstracted as any other existing storage backend in RAW, two important features were added to RAW Labs’ engine, none of which was available prior to its participation to SmartDataLake:

– Apache Arrow is now one of the formats supported by RAW. That format has been gaining in popularity in recent years, especially as an interoperation serialization format.

– Parallel export of query results to storage systems (HDFS, S3 and eventually RDBMS) is now implemented in RAW. That had been among features requested by RAW Labs’ customers.